Introducing the selectr Package

When working with XML documents a common task is to be able to search for parts of a document that match a search query. For example, if we have a document representing a collection of books, we might want to search through for a book matching a certain title or author. A language has been created that constructs search queries on XML documents called XPath [1]. XPath is capable of constructing complex search queries but this is often at the cost of readability and terseness of the resulting expression.

An alternative way of searching for parts of a document is using CSS selectors [2]. These are most commonly used in web browsers to apply styling information to components of a web page. We can use the same language that selects which nodes to style in a web page to select nodes in an XML document. This often produces more concise and readable queries than the equivalent XPath expression. It must be noted however that XPath expressions are more flexible than CSS selectors, so although all CSS selectors have an equivalent XPath expression, the reverse is not true.

An advantage of using CSS selectors is that most people working with web documents such as HTML and SVG also know CSS. XPath is not employed anywhere beyond querying XML documents so is not a commonly known query language. Another important reason why CSS selectors are widely known is due to the common use of them in popular JavaScript libraries. jQuery [3] and D3 [4] are two examples where they select elements of a page to perform operations on using CSS selectors, rather than XPath. This is mostly due to the complexity of performing an XPath query in the browser in addition to the more verbose expressions. An example of how one would use CSS selectors to retrieve content using popular JavaScript libraries is with the following code:

The XML package [5] for R [6] is able to parse XML documents, which can later be queried using XPath. No facility exists for using CSS selectors on XML documents in the XML package. This limitation is due to the XML package’s dependence on the libxml2 [7] library which can only search using XPath. For reasons mentioned above, it would be ideal if we had the option of using CSS selectors to query such documents as well as XPath. If we can translate a CSS selector to XPath, then the restriction to only using XPath no longer applies and we can therefore use a CSS selector whenever an XPath expression is required.

A mature Python package exists that performs translation of CSS selectors to XPath expressions. Unfortunately this package, cssselect [8], cannot be used in R because it would require Python to be present on a user's system which cannot be guaranteed, particularly on Windows. The selectr package [9] is a translation of cssselect to R so that we have the same functionality within R as we would have using Python.

The rest of this article describes the process that selectr takes to translate CSS selectors to XPath expressions along with example usage.

Parsing

The first step in translating any language to another is to first tokenise an expression into individual words, numbers, whitespace and symbols that represent the core structure of an expression. These pieces are called tokens. The following code shows character representation of tokens that have been created from tokenising a CSS selector expression:

R> tokenize("body > p")

[1] "<IDENT 'body' at 1>"
[2] "<S ' ' at 5>"       
[3] "<DELIM '>' at 6>"
[4] "<S ' ' at 7>"       
[5] "<IDENT 'p' at 8>"   
[6] "<EOF at 9>"

The selector body > p is a query that looks for all “p” elements within the document that are also direct descendants of a “body” element. We can see that the selector has been tokenised into 6 tokens. Each token has the following structure: type, value, position. The type is the type of token it is, an identifier, whitespace, a number or a delimiter. The value is the actual text that a token represents, while the position is simply the position along the string at which the character was found.

Once we have the required tokens, it is necessary to parse these tokens into a form that applies meaning to the tokens. For example, in CSS a # preceding an identifier means that we are looking for an element with an ID matching that identifier. After parsing our tokens, we have an understanding of what the CSS selector means and therefore have the correct internal representation prior to translation to XPath. The following code shows what our example selector is understood to mean:

R> parse("body > p")

[1] "CombinedSelector[Element[body] > Element[p]]"

This shows that the selector is understood to be a combined selector that matches when a p element is a direct descendant of a body element. Once the parsing step is complete, it is necessary to translate this internal representation of a selector into its equivalent XPath expression.

Translating

XPath is a superset of the functionality of CSS selectors, so we can ensure that there is a mapping from CSS to XPath. Given that we already know the parsed structure of the selector, we work from the outer-most selector inwards. This means with the parsed selector body > p we look at the CombinedSelector first, then the remaining Element components. In this case we know that the CombinedSelector is going to map to Element[body]/Element[p], which in turn produces body/p.

Some of these mappings are straightforward as was the case in the given example, but others are more complex. The table below shows a sample of the translations that occur:

These only touch on the possible translations, but it demonstrates that a mapping from CSS to XPath exists.

Usage

CSS Selector	Parsed Structure	XPath Expression
`#test`	`Hash[Element[*]#test]`	`*[@id = 'test']`
`.test`	`Class[Element[*].test]`	`*[@class and contains(concat(' ', normalize-space(@class), ' '), ' test ')]`
`body p`	`CombinedSelector[Element[body] <followed> Element[p]]`	`body/descendant-or-self::*/p`
`a[title]`	`Attrib[Element[a][title]]`	`a[@title]`
`div[class^='btn']`	`Attrib[Element[div][class ^= 'btn']]`	`div[@class and starts-with(@class, 'btn')]`
`li:nth-child(even)`	`Function[Element[li]:nth-child(['even'])]`	`/[name() = 'li' and ((position() +0) mod 2 = 0 and position() >= 0)]`
`#outer-div :first-child`	`CombinedSelector[Hash[Element[]#outer-div] <followed> Pseudo[Element[]:first-child]]`	`[@id = 'outer-div']/descendant-or-self::/*[position() = 1]`

The selectr package becomes most useful when working with the XML package. Most commonly selectr is used to simplify the task of searching for a set of nodes. In the browser, there are two JavaScript functions that perform this task using CSS selectors, querySelector() and querySelectorAll() [10]. These functions are methods on a document or element object. Typical usage in the browser using JavaScript might be the following:

Because these are so commonly used in popular JavaScript libraries, the behaviour has been mimicked in selectr. The selectr package also provides these functions but instead of being methods on document or element objects, they are functions. These functions typically take two parameters, the XML object to be searched on, and the CSS selector to query with, respectively.

The difference between the two functions is that querySelector() will attempt to return the first matching node or NULL in the case that no matches were found. querySelectorAll() will always return a list of matching nodes, this list will be empty when there are no matches. To demonstrate the usage of these functions, the following XML document will be used:

R> library(XML)
R> exdoc <- xmlRoot(xmlParse('<a><b class="aclass"/><c id="anid"/></a>'))
R> exdoc

<a>
  <b class="aclass"/>
  <c id="anid"/>
</a>

R> library(selectr)
R> querySelector(exdoc, "#anid")   # Returns the matching node

R> querySelector(exdoc, ".aclass") # Returns the matching node

R> querySelector(exdoc, "b, c")    # First match from grouped selection

R> querySelector(exdoc, "d")       # No match

R> querySelectorAll(exdoc, "b, c") # Grouped selection

[[1]]
<b class="aclass"/> 

[[2]]
<c id="anid"/> 

attr(,"class")
[1] "XMLNodeSet"

R> querySelectorAll(exdoc, "b")    # A list of length one

[[1]]
<b class="aclass"/> 

attr(,"class")
[1] "XMLNodeSet"

R> querySelectorAll(exdoc, "d")    # No match

list()
attr(,"class")
[1] "XMLNodeSet"

The main point to get across is that querySelector() returns a node, querySelectorAll() returns a list of nodes.

Both querySelector() and querySelectorAll() are paired with namespaced equivalents, querySelectorNS() and querySelectorAllNS() respectively. These functions will be demonstrated in more detail later in this article.

While the aforementioned functions are certainly useful, they do not cover all possible use cases. For other uses of CSS selectors, the css_to_xpath() function can be used where an XPath expression would normally be expected. The css_to_xpath() function has three parameters. The first parameter is simply the CSS selector, then a prefix on the resulting XPath expression. This prefix is useful in the circumstance when you already know some XPath and know where the selector should be scoped to. The final parameter determines the translator to use when translating selectors to XPath expressions. The generic translator is sufficient in most cases except when (X)HTML is used; in those cases a translator can be used is aware of (X)HTML pseudo-selectors. A case where css_to_xpath() may be used is when using XML's *apply functions, as shown below.

R> # Let's see all tag names present in the doc
R> xpathSApply(exdoc, css_to_xpath("*"), xmlName)

[1] "a" "b" "c"

R> # What is the value of the class attribute on all "b" elements?
R> xpathSApply(exdoc, css_to_xpath("b"),
R+             function(x) xmlGetAttr(x, "class"))

[1] "aclass"

Rather than returning nodes, we are processing each node using a given function from the XML package, but specifying paths using CSS selectors instead.

Examples

While the example usage of the selectr package has been demonstrated earlier, the real-world usage may not be clear, or indeed the benefits over just using the XML package. To show how succinct it can be, we will try to create a data frame in R that lists the titles and URLs of technical reports hosted on the Department of Statistics Technical Report Blog, along with their publishing dates. First, lets examine part of the HTML that comprises the page to see how we're going to be selecting content.

This fragment shows us that we have the information available to us, we just need to know how to query it. For example, we can see that the URL to each technical report is in the href attribute within an a element. In particular, this a element has an h1 parent with a class of entry-title. The a element also contains the title of the technical report. Similarly we can see a time element that tells us via the datetime attribute when the post was published. We first start by loading the required packages and retrieving the data so that we can work with it.

R> library(XML)
R> library(selectr)
R> page <- htmlParse("http://stattech.wordpress.fos.auckland.ac.nz/")

Now that the page has been parsed into a queryable form, we can write the required CSS selectors to retrieve this information using querySelectorAll().

R> # CSS selector to get titles and URLs: "h1.entry-title > a"
R> links <- querySelectorAll(page, "h1.entry-title > a")
R> # Now lets get all of the publishing times
R> timeEls <- querySelectorAll(page, "time")

Now that we have gathered the correct elements, it is reasonably simple to manipulate them using the XML package. We want to extract the correct attributes and values from the selected nodes. The code below shows how we would do this.

R> # Collect all URLs
R> urls <- sapply(links, function(x) xmlGetAttr(x, "href"))
R> # Collect all titles
R> titles <- sapply(links, xmlValue)
R> # Collect all datetime attributes
R> dates <- sapply(timeEls, function(x) xmlGetAttr(x, "datetime"))
R> # To play nice with R, lets parse it as a Date
R> dates <- as.Date(dates)
R> # Create a data frame of the results
R> technicalReports <- data.frame(title = titles,
R+                                url = urls,
R+                                date = dates,
R+                                stringsAsFactors = FALSE)
R> # and show one column at a time
R> technicalReports$title

[1] "2012-9 Writing grid Extensions"                                                                          
[2] "2012-8 Meta-analysis of a rare-variant association test"                                                 
[3] "2012-7 A Structured Approach for Generating SVG"                                                         
[4] "2012-6 Working with the gridSVG Coordinate System"                                                       
[5] "2012-5 Voronoi Treemaps in R"                                                                            
[6] "2012-4 Two-sample rank tests under complex sampling"                                                     
[7] "2012-3 An empirical-process central limit theorem for complex sampling under bounds on the design effect"
[8] "2012-2: Two-phase subsampling designs for  genomic resequencing studies"                                 
[9] "2012-1: Partial Likelihood Ratio Tests for the Cox model under Complex Sampling"

R> technicalReports$url

[1] "http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/"                                                                   
[2] "http://stattech.wordpress.fos.auckland.ac.nz/2012-8-meta-analysis-of-a-rare-variant-association-test/"                                          
[3] "http://stattech.wordpress.fos.auckland.ac.nz/2012-7-a-structured-approach-for-generating-svg/"                                                  
[4] "http://stattech.wordpress.fos.auckland.ac.nz/2012-6-working-with-the-gridsvg-coordinate-system/"                                                
[5] "http://stattech.wordpress.fos.auckland.ac.nz/voronoi-treemaps-in-r/"                                                                            
[6] "http://stattech.wordpress.fos.auckland.ac.nz/two-sample-rank-tests-under-complex-sampling/"                                                     
[7] "http://stattech.wordpress.fos.auckland.ac.nz/an-empirical-process-central-limit-theorem-for-complex-sampling-under-bounds-on-the-design-effect/"
[8] "http://stattech.wordpress.fos.auckland.ac.nz/2012-2-two-phase-subsampling-designs-for-genomic-resequencing-studies/"                            
[9] "http://stattech.wordpress.fos.auckland.ac.nz/2012-1-partial-likelihood-ratio-tests-for-the-cox-model-under-complex-sampling/"

R> technicalReports$date

[1] "2012-11-06" "2012-11-04" "2012-10-15" "2012-10-10" "2012-09-19"
[6] "2012-06-20" "2012-06-20" "2012-05-24" "2012-05-24"

A complex example

An example (see “XPath”) written for the gridSVG package [11] will be revisited. The example first shows a ggplot2 [12] plot that has been exported to SVG using gridSVG. The aim is to then remove the legend from the plot by removing the node containing all legend information. Once the node has been removed, the resulting document can be saved to produce an image with a legend removed.

What is of particular interest with this example is that it uses SVG, which is a namespaced XML document. This provides some challenges that require consideration, but the selectr package can handle this case.

R> library(ggplot2)
R> library(gridSVG)
R> qplot(mpg, wt, data=mtcars, colour=cyl)
R> svgdoc <- gridToSVG(name=NULL, "none", "none")$svg

So far we have simply reproduced the original plot and stored the resulting XML in a node tree called svgdoc. In order to remove the legend from the plot we first need to select the legend node from the SVG document. We will compare the XML-only approach with one enhanced with selectr. The comparison is shown below:

R> # XPath
R> legendNode <- getNodeSet(svgdoc,
R+                          "//svg:g[@id='layout::guide-box.3-5-3-5.1']",
R+                          c(svg = "http://www.w3.org/2000/svg"))[[1]]
R> # CSS
R> legendNode <- querySelector(svgdoc,
R+                             "#layout\\:\\:guide-box\\.3-5-3-5\\.1",
R+                             c(svg = "http://www.w3.org/2000/svg"),
R+                             prefix = "//svg:*/descendant-or-self::")

This particular example demonstrates a case where the XPath approach is more concise. This is because the id attribute that we're searching for needs to have its CSS selector escaped (due to : and . being special characters in CSS), while the XPath expression remains unchanged. Additionally, we also need to specify a namespace-aware prefix for the XPath that is generated. To use CSS selectors in this case required knowledge of XPath that would rather be avoided.

To work around this issue, a namespace-aware function should be used instead to abstract away the XPath dependent code. The following code demonstrates the use of selectr’s namespace-aware function querySelectorNS():

R> legendNode <- querySelectorNS(svgdoc,
R+                               "#layout\\:\\:guide-box\\.3-5-3-5\\.1",
R+                               c(svg = "http://www.w3.org/2000/svg"))

The resulting use of CSS selection is now as concise as the XPath version, with the only special consideration being the requirement of escaping the CSS selector.

Now that the legend has been selected, we can remove it from the SVG document to produce an image with a legend omitted.

R> removeChildren(xmlParent(legendNode), legendNode)
R> saveXML(svgdoc, file = NULL)

Conclusion

This article describes the new selectr package. Its main purpose is to allow the use of CSS selectors in a domain which previously only allowed XPath. In addition, convenience functions have also been described; allowing easy use of CSS selectors for the purpose of retrieving parts of an XML document. It has been demonstrated that the selectr package augments the XML package with the ability to use a more concise language for selecting content from an XML document.

Introducing the ‘selectr’ Package

Introduction