Simon Potter
        simon.potter@auckland.ac.nz
      
Department of Statistics, University of Auckland
November 22, 2012
Abstract: The selectr package translates a CSS selector into an equivalent XPath expression. This allows the use of CSS selectors to query XML documents using the XML package. Convenience functions are also provided to mimic functionality present in modern web browsers.
When working with XML documents a common task is to be able to search for parts of a document that match a search query. For example, if we have a document representing a collection of books, we might want to search through for a book matching a certain title or author. A language has been created that constructs search queries on XML documents called XPath [1]. XPath is capable of constructing complex search queries but this is often at the cost of readability and terseness of the resulting expression.
An alternative way of searching for parts of a document is using CSS selectors [2]. These are most commonly used in web browsers to apply styling information to components of a web page. We can use the same language that selects which nodes to style in a web page to select nodes in an XML document. This often produces more concise and readable queries than the equivalent XPath expression. It must be noted however that XPath expressions are more flexible than CSS selectors, so although all CSS selectors have an equivalent XPath expression, the reverse is not true.
An advantage of using CSS selectors is that most people working with web documents such as HTML and SVG also know CSS. XPath is not employed anywhere beyond querying XML documents so is not a commonly known query language. Another important reason why CSS selectors are widely known is due to the common use of them in popular JavaScript libraries. jQuery [3] and D3 [4] are two examples where they select elements of a page to perform operations on using CSS selectors, rather than XPath. This is mostly due to the complexity of performing an XPath query in the browser in addition to the more verbose expressions. An example of how one would use CSS selectors to retrieve content using popular JavaScript libraries is with the following code:
JS> // jQuery
JS> $("#listing code");
JS> // D3
JS> d3.selectAll("#listing code");
JS> // The equivalent XPath expression
JS> "descendant-or-self::*[@id = 'listing']/descendant-or-self::*/code"
      code elements that are descendants of the
        element with the ID “listing”.
      The XML package [5] for R [6] is able to parse XML documents, which can later be queried using XPath. No facility exists for using CSS selectors on XML documents in the XML package. This limitation is due to the XML package’s dependence on the libxml2 [7] library which can only search using XPath. For reasons mentioned above, it would be ideal if we had the option of using CSS selectors to query such documents as well as XPath. If we can translate a CSS selector to XPath, then the restriction to only using XPath no longer applies and we can therefore use a CSS selector whenever an XPath expression is required.
A mature Python package exists that performs translation of CSS selectors to XPath expressions. Unfortunately this package, cssselect [8], cannot be used in R because it would require Python to be present on a user's system which cannot be guaranteed, particularly on Windows. The selectr package [9] is a translation of cssselect to R so that we have the same functionality within R as we would have using Python.
The rest of this article describes the process that selectr takes to translate CSS selectors to XPath expressions along with example usage.
The first step in translating any language to another is to first tokenise an expression into individual words, numbers, whitespace and symbols that represent the core structure of an expression. These pieces are called tokens. The following code shows character representation of tokens that have been created from tokenising a CSS selector expression:
R> tokenize("body > p")
[1] "<IDENT 'body' at 1>" [2] "<S ' ' at 5>" [3] "<DELIM '>' at 6>" [4] "<S ' ' at 7>" [5] "<IDENT 'p' at 8>" [6] "<EOF at 9>"
      The selector body > p is a query that looks for all
      “p” elements within the document that are also direct
      descendants of a “body” element. We can see that the selector
      has been tokenised into 6 tokens. Each token has the following structure:
      type, value, position. The type is the type of token it is, an identifier,
      whitespace, a number or a delimiter. The value is the actual text that a
      token represents, while the position is simply the position along the
      string at which the character was found.
    
      Once we have the required tokens, it is necessary to
      parse these tokens into a form that applies meaning to
      the tokens. For example, in CSS a # preceding an identifier
      means that we are looking for an element with an ID matching that
      identifier. After parsing our tokens, we have an understanding of what the
      CSS selector means and therefore have the correct internal representation
      prior to translation to XPath. The following code shows what our example
      selector is understood to mean:
    
R> parse("body > p")
[1] "CombinedSelector[Element[body] > Element[p]]"
      This shows that the selector is understood to be a combined selector that
      matches when a p element is a direct descendant of a
      body element. Once the parsing step is complete, it is
      necessary to translate this internal representation of a selector into its
      equivalent XPath expression.
    
      XPath is a superset of the functionality of CSS selectors, so we can
      ensure that there is a mapping from CSS to XPath. Given that we already
      know the parsed structure of the selector, we work from the outer-most
      selector inwards. This means with the parsed selector
      body > p we look at the
      CombinedSelector first, then the remaining
      Element components. In this case we know that the
      CombinedSelector is going to map to
      Element[body]/Element[p], which in turn produces
      body/p.
    
Some of these mappings are straightforward as was the case in the given example, but others are more complex. The table below shows a sample of the translations that occur:
| CSS Selector | Parsed Structure | XPath Expression | 
|---|---|---|
#test | 
          Hash[Element[*]#test] | 
          *[@id = 'test'] | 
        
.test | 
          Class[Element[*].test] | 
          
            *[@class and contains(concat(' ', normalize-space(@class), ' '),
              ' test ')]
           | 
        
body p | 
          
            CombinedSelector[Element[body] <followed> Element[p]]
           | 
          body/descendant-or-self::*/p | 
        
a[title] | 
          Attrib[Element[a][title]] | 
          a[@title] | 
        
div[class^='btn'] | 
          Attrib[Element[div][class ^= 'btn']] | 
          div[@class and starts-with(@class, 'btn')] | 
        
li:nth-child(even) | 
          Function[Element[li]:nth-child(['even'])] | 
          
            */*[name() = 'li' and ((position() +0) mod 2 = 0 and position()
              >= 0)]
           | 
        
#outer-div :first-child | 
          
            CombinedSelector[Hash[Element[*]#outer-div] <followed>
              Pseudo[Element[*]:first-child]]
           | 
          
            *[@id = 'outer-div']/descendant-or-self::*/*[position() =
              1]
           | 
        
These only touch on the possible translations, but it demonstrates that a mapping from CSS to XPath exists.
      The selectr package becomes most useful when working with the
      XML package. Most commonly selectr is used to simplify
      the task of searching for a set of nodes. In the browser, there are two
      JavaScript functions that perform this task using CSS selectors,
      querySelector() and querySelectorAll()
      [10]. These functions are methods on a document or
      element object. Typical usage in the browser using JavaScript might be the
      following:
    
JS> document.querySelector("ul li.active");
JS> document.querySelectorAll("p > a.info");
      Because these are so commonly used in popular JavaScript libraries, the behaviour has been mimicked in selectr. The selectr package also provides these functions but instead of being methods on document or element objects, they are functions. These functions typically take two parameters, the XML object to be searched on, and the CSS selector to query with, respectively.
      The difference between the two functions is that
      querySelector() will attempt to return the
      first matching node or NULL in the case that
      no matches were found. querySelectorAll() will always return
      a list of matching nodes, this list will be empty when there are no
      matches. To demonstrate the usage of these functions, the following XML
      document will be used:
    
R> library(XML) R> exdoc <- xmlRoot(xmlParse('<a><b class="aclass"/><c id="anid"/></a>')) R> exdoc
<a> <b class="aclass"/> <c id="anid"/> </a>
We will first see how querySelector() is used.
R> library(selectr) R> querySelector(exdoc, "#anid") # Returns the matching node
<c id="anid"/>
R> querySelector(exdoc, ".aclass") # Returns the matching node
<b class="aclass"/>
R> querySelector(exdoc, "b, c") # First match from grouped selection
<b class="aclass"/>
R> querySelector(exdoc, "d") # No match
NULL
      Now compare this to the results returned by
      querySelectorAll():
    
R> querySelectorAll(exdoc, "b, c") # Grouped selection
[[1]] <b class="aclass"/> [[2]] <c id="anid"/> attr(,"class") [1] "XMLNodeSet"
R> querySelectorAll(exdoc, "b") # A list of length one
[[1]] <b class="aclass"/> attr(,"class") [1] "XMLNodeSet"
R> querySelectorAll(exdoc, "d") # No match
list() attr(,"class") [1] "XMLNodeSet"
      The main point to get across is that querySelector() returns
      a node, querySelectorAll() returns a
      list of nodes.
    
      Both querySelector() and querySelectorAll() are
      paired with namespaced equivalents, querySelectorNS() and
      querySelectorAllNS() respectively. These functions will be
      demonstrated in more detail later in this article.
    
      While the aforementioned functions are certainly useful, they do not cover
      all possible use cases. For other uses of CSS selectors, the
      css_to_xpath() function can be used where an XPath expression
      would normally be expected. The css_to_xpath() function has
      three parameters. The first parameter is simply the CSS selector, then a
      prefix on the resulting XPath expression. This prefix is useful in the
      circumstance when you already know some XPath and know where the selector
      should be scoped to. The final parameter determines the translator to use
      when translating selectors to XPath expressions. The generic translator is
      sufficient in most cases except when (X)HTML is used; in those cases a
      translator can be used is aware of (X)HTML pseudo-selectors. A case where
      css_to_xpath() may be used is when using XML's
      *apply functions, as shown below.
    
R> # Let's see all tag names present in the doc R> xpathSApply(exdoc, css_to_xpath("*"), xmlName)
[1] "a" "b" "c"
R> # What is the value of the class attribute on all "b" elements? R> xpathSApply(exdoc, css_to_xpath("b"), R+ function(x) xmlGetAttr(x, "class"))
[1] "aclass"
Rather than returning nodes, we are processing each node using a given function from the XML package, but specifying paths using CSS selectors instead.
While the example usage of the selectr package has been demonstrated earlier, the real-world usage may not be clear, or indeed the benefits over just using the XML package. To show how succinct it can be, we will try to create a data frame in R that lists the titles and URLs of technical reports hosted on the Department of Statistics Technical Report Blog, along with their publishing dates. First, lets examine part of the HTML that comprises the page to see how we're going to be selecting content.
...
<article>
  <header>
    <h1 class="entry-title">
      <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/"
         title="Permalink to 2012-9 Writing grid Extensions"
         rel="bookmark">2012-9 Writing grid Extensions</a>
    </h1>
    <div class="entry-meta">
      <span class="sep">Posted on </span>
        <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/"
           title="9:48 pm"
           rel="bookmark">
          <time class="entry-date"
                datetime="2012-11-06T21:48:17+00:00" pubdate>
            November 6, 2012
          </time>
        </a>
...
    
      This fragment shows us that we have the information available to us, we
      just need to know how to query it. For example, we can see that the URL to
      each technical report is in the href attribute within an
      a element. In particular, this a element has an
      h1 parent with a class of entry-title. The
      a element also contains the title of the technical report.
      Similarly we can see a time element that tells us via the
      datetime attribute when the post was published. We first
      start by loading the required packages and retrieving the data so that we
      can work with it.
    
R> library(XML) R> library(selectr) R> page <- htmlParse("http://stattech.wordpress.fos.auckland.ac.nz/")
      Now that the page has been parsed into a queryable form, we can write the
      required CSS selectors to retrieve this information using
      querySelectorAll().
    
R> # CSS selector to get titles and URLs: "h1.entry-title > a" R> links <- querySelectorAll(page, "h1.entry-title > a") R> # Now lets get all of the publishing times R> timeEls <- querySelectorAll(page, "time")
Now that we have gathered the correct elements, it is reasonably simple to manipulate them using the XML package. We want to extract the correct attributes and values from the selected nodes. The code below shows how we would do this.
R> # Collect all URLs R> urls <- sapply(links, function(x) xmlGetAttr(x, "href")) R> # Collect all titles R> titles <- sapply(links, xmlValue) R> # Collect all datetime attributes R> dates <- sapply(timeEls, function(x) xmlGetAttr(x, "datetime")) R> # To play nice with R, lets parse it as a Date R> dates <- as.Date(dates) R> # Create a data frame of the results R> technicalReports <- data.frame(title = titles, R+ url = urls, R+ date = dates, R+ stringsAsFactors = FALSE) R> # and show one column at a time R> technicalReports$title
[1] "2012-9 Writing grid Extensions" [2] "2012-8 Meta-analysis of a rare-variant association test" [3] "2012-7 A Structured Approach for Generating SVG" [4] "2012-6 Working with the gridSVG Coordinate System" [5] "2012-5 Voronoi Treemaps in R" [6] "2012-4 Two-sample rank tests under complex sampling" [7] "2012-3 An empirical-process central limit theorem for complex sampling under bounds on the design effect" [8] "2012-2: Two-phase subsampling designs for genomic resequencing studies" [9] "2012-1: Partial Likelihood Ratio Tests for the Cox model under Complex Sampling"
R> technicalReports$url
[1] "http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" [2] "http://stattech.wordpress.fos.auckland.ac.nz/2012-8-meta-analysis-of-a-rare-variant-association-test/" [3] "http://stattech.wordpress.fos.auckland.ac.nz/2012-7-a-structured-approach-for-generating-svg/" [4] "http://stattech.wordpress.fos.auckland.ac.nz/2012-6-working-with-the-gridsvg-coordinate-system/" [5] "http://stattech.wordpress.fos.auckland.ac.nz/voronoi-treemaps-in-r/" [6] "http://stattech.wordpress.fos.auckland.ac.nz/two-sample-rank-tests-under-complex-sampling/" [7] "http://stattech.wordpress.fos.auckland.ac.nz/an-empirical-process-central-limit-theorem-for-complex-sampling-under-bounds-on-the-design-effect/" [8] "http://stattech.wordpress.fos.auckland.ac.nz/2012-2-two-phase-subsampling-designs-for-genomic-resequencing-studies/" [9] "http://stattech.wordpress.fos.auckland.ac.nz/2012-1-partial-likelihood-ratio-tests-for-the-cox-model-under-complex-sampling/"
R> technicalReports$date
[1] "2012-11-06" "2012-11-04" "2012-10-15" "2012-10-10" "2012-09-19" [6] "2012-06-20" "2012-06-20" "2012-05-24" "2012-05-24"
An example (see “XPath”) written for the gridSVG package [11] will be revisited. The example first shows a ggplot2 [12] plot that has been exported to SVG using gridSVG. The aim is to then remove the legend from the plot by removing the node containing all legend information. Once the node has been removed, the resulting document can be saved to produce an image with a legend removed.
What is of particular interest with this example is that it uses SVG, which is a namespaced XML document. This provides some challenges that require consideration, but the selectr package can handle this case.
R> library(ggplot2) R> library(gridSVG) R> qplot(mpg, wt, data=mtcars, colour=cyl) R> svgdoc <- gridToSVG(name=NULL, "none", "none")$svg
      So far we have simply reproduced the original plot and stored the
      resulting XML in a node tree called svgdoc. In order to
      remove the legend from the plot we first need to select the legend node
      from the SVG document. We will compare the XML-only approach with
      one enhanced with selectr. The comparison is shown below:
    
R> # XPath R> legendNode <- getNodeSet(svgdoc, R+ "//svg:g[@id='layout::guide-box.3-5-3-5.1']", R+ c(svg = "http://www.w3.org/2000/svg"))[[1]] R> # CSS R> legendNode <- querySelector(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"), R+ prefix = "//svg:*/descendant-or-self::")
      This particular example demonstrates a case where the XPath approach is
      more concise. This is because the id attribute that we're
      searching for needs to have its CSS selector escaped (due to
      : and . being special characters in CSS), while
      the XPath expression remains unchanged. Additionally, we also need to
      specify a namespace-aware prefix for the XPath that is generated. To use
      CSS selectors in this case required knowledge of XPath that would rather
      be avoided.
    
      To work around this issue, a namespace-aware function should be used
      instead to abstract away the XPath dependent code. The following code
      demonstrates the use of selectr’s namespace-aware function
      querySelectorNS():
    
R> legendNode <- querySelectorNS(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"))
The resulting use of CSS selection is now as concise as the XPath version, with the only special consideration being the requirement of escaping the CSS selector.
Now that the legend has been selected, we can remove it from the SVG document to produce an image with a legend omitted.
R> removeChildren(xmlParent(legendNode), legendNode) R> saveXML(svgdoc, file = NULL)
This article describes the new selectr package. Its main purpose is to allow the use of CSS selectors in a domain which previously only allowed XPath. In addition, convenience functions have also been described; allowing easy use of CSS selectors for the purpose of retrieving parts of an XML document. It has been demonstrated that the selectr package augments the XML package with the ability to use a more concise language for selecting content from an XML document.
      This document is licensed under a
      
        Creative Commons Attribution 3.0 New Zealand License .
      
      The code is freely available under the
      GPL. The described functionality of
      selectr is present in version 0.2-0. selectr is
      available on
      CRAN
      and development occurs on GitHub at
      https://github.com/sjp/selectr.