When working with XML documents a common task is to be able to search for parts of a document that match a search query. For example, if we have a document representing a collection of books, we might want to search through for a book matching a certain title or author. A language has been created that constructs search queries on XML documents called XPath . XPath is capable of constructing complex search queries but this is often at the cost of readability and terseness of the resulting expression.
An alternative way of searching for parts of a document is using CSS selectors . These are most commonly used in web browsers to apply styling information to components of a web page. We can use the same language that selects which nodes to style in a web page to select nodes in an XML document. This often produces more concise and readable queries than the equivalent XPath expression. It must be noted however that XPath expressions are more flexible than CSS selectors, so although all CSS selectors have an equivalent XPath expression, the reverse is not true.
The XML package  for R  is able to parse XML documents, which can later be queried using XPath. No facility exists for using CSS selectors on XML documents in the XML package. This limitation is due to the XML package’s dependence on the libxml2  library which can only search using XPath. For reasons mentioned above, it would be ideal if we had the option of using CSS selectors to query such documents as well as XPath. If we can translate a CSS selector to XPath, then the restriction to only using XPath no longer applies and we can therefore use a CSS selector whenever an XPath expression is required.
A mature Python package exists that performs translation of CSS selectors to XPath expressions. Unfortunately this package, cssselect , cannot be used in R because it would require Python to be present on a user's system which cannot be guaranteed, particularly on Windows. The selectr package  is a translation of cssselect to R so that we have the same functionality within R as we would have using Python.
The rest of this article describes the process that selectr takes to translate CSS selectors to XPath expressions along with example usage.
The first step in translating any language to another is to first tokenise an expression into individual words, numbers, whitespace and symbols that represent the core structure of an expression. These pieces are called tokens. The following code shows character representation of tokens that have been created from tokenising a CSS selector expression:
R> tokenize("body > p")
 "<IDENT 'body' at 1>"  "<S ' ' at 5>"  "<DELIM '>' at 6>"  "<S ' ' at 7>"  "<IDENT 'p' at 8>"  "<EOF at 9>"
body > p is a query that looks for
all “p” elements within the document that are also
direct descendants of a “body” element. We can see
that the selector has been tokenised into 6 tokens. Each token has
the following structure: type, value, position. The type is the
type of token it is, an identifier, whitespace, a number or a
delimiter. The value is the actual text that a token represents,
while the position is simply the position along the string at
which the character was found.
Once we have the required tokens, it is necessary
to parse these tokens into a form that applies
meaning to the tokens. For example, in CSS a
preceding an identifier means that we are looking for an element
with an ID matching that identifier. After parsing our tokens, we
have an understanding of what the CSS selector means and therefore
have the correct internal representation prior to translation to
XPath. The following code shows what our example selector is
understood to mean:
R> parse("body > p")
 "CombinedSelector[Element[body] > Element[p]]"
This shows that the selector is understood to be a combined
selector that matches when a
p element is a direct
descendant of a
body element. Once the parsing step
is complete, it is necessary to translate this internal
representation of a selector into its equivalent XPath
XPath is a superset of the functionality of CSS selectors, so
we can ensure that there is a mapping from CSS to XPath. Given
that we already know the parsed structure of the selector, we work
from the outer-most selector inwards. This means with the parsed
body > p we look at
CombinedSelector first, then the
Element components. In this case we know
CombinedSelector is going to map
Element[body]/Element[p], which in turn
Some of these mappings are straightforward as was the case in the given example, but others are more complex. The table below shows a sample of the translations that occur:
|CSS Selector||Parsed Structure||XPath Expression|
These only touch on the possible translations, but it demonstrates that a mapping from CSS to XPath exists.
The selectr package becomes most useful when working
with the XML package. Most commonly selectr is
used to simplify the task of searching for a set of nodes. In the
using CSS selectors,
querySelectorAll() . These
functions are methods on a document or element object. Typical
The difference between the two functions is
querySelector() will attempt to return
the first matching node or
the case that no matches were
querySelectorAll() will always return a list
of matching nodes, this list will be empty when there are no
matches. To demonstrate the usage of these functions, the
following XML document will be used:
R> library(XML) R> exdoc <- xmlRoot(xmlParse('<a><b class="aclass"/><c id="anid"/></a>')) R> exdoc
<a> <b class="aclass"/> <c id="anid"/> </a>
We will first see how
querySelector() is used.
R> library(selectr) R> querySelector(exdoc, "#anid") # Returns the matching node
R> querySelector(exdoc, ".aclass") # Returns the matching node
R> querySelector(exdoc, "b, c") # First match from grouped selection
R> querySelector(exdoc, "d") # No match
Now compare this to the
results returned by
R> querySelectorAll(exdoc, "b, c") # Grouped selection
[] <b class="aclass"/> [] <c id="anid"/> attr(,"class")  "XMLNodeSet"
R> querySelectorAll(exdoc, "b") # A list of length one
[] <b class="aclass"/> attr(,"class")  "XMLNodeSet"
R> querySelectorAll(exdoc, "d") # No match
list() attr(,"class")  "XMLNodeSet"
The main point to get across is
returns a node,
returns a list of nodes.
querySelectorAll() are paired with namespaced
querySelectorAllNS() respectively. These
functions will be demonstrated in more detail later in this
While the aforementioned functions are certainly useful, they
do not cover all possible use cases. For other uses of CSS
css_to_xpath() function can be used
where an XPath expression would normally be
css_to_xpath() function has three
parameters. The first parameter is simply the CSS selector, then a
prefix on the resulting XPath expression. This prefix is useful in
the circumstance when you already know some XPath and know where
the selector should be scoped to. The final parameter determines
the translator to use when translating selectors to XPath
expressions. The generic translator is sufficient in most cases
except when (X)HTML is used; in those cases a translator can be
used is aware of (X)HTML pseudo-selectors. A case
css_to_xpath() may be used is when
*apply functions, as shown
R> # Let's see all tag names present in the doc R> xpathSApply(exdoc, css_to_xpath("*"), xmlName)
 "a" "b" "c"
R> # What is the value of the class attribute on all "b" elements? R> xpathSApply(exdoc, css_to_xpath("b"), R+ function(x) xmlGetAttr(x, "class"))
Rather than returning nodes, we are processing each node using a given function from the XML package, but specifying paths using CSS selectors instead.
While the example usage of the selectr package has been demonstrated earlier, the real-world usage may not be clear, or indeed the benefits over just using the XML package. To show how succinct it can be, we will try to create a data frame in R that lists the titles and URLs of technical reports hosted on the Department of Statistics Technical Report Blog, along with their publishing dates. First, lets examine part of the HTML that comprises the page to see how we're going to be selecting content.
... <article> <header> <h1 class="entry-title"> <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" title="Permalink to 2012-9 Writing grid Extensions" rel="bookmark">2012-9 Writing grid Extensions</a> </h1> <div class="entry-meta"> <span class="sep">Posted on </span> <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" title="9:48 pm" rel="bookmark"> <time class="entry-date" datetime="2012-11-06T21:48:17+00:00" pubdate> November 6, 2012 </time> </a> ...
This fragment shows us that we have the information available
to us, we just need to know how to query it. For example, we can
see that the URL to each technical report is in
href attribute within an
element. In particular, this
a element has
h1 parent with a class
a element also
contains the title of the technical report. Similarly we can see
time element that tells us via
datetime attribute when the post was
published. We first start by loading the required packages and
retrieving the data so that we can work with it.
R> library(XML) R> library(selectr) R> page <- htmlParse("http://stattech.wordpress.fos.auckland.ac.nz/")
Now that the page has been parsed into a queryable form, we can
write the required CSS selectors to retrieve this information
R> # CSS selector to get titles and URLs: "h1.entry-title > a" R> links <- querySelectorAll(page, "h1.entry-title > a") R> # Now lets get all of the publishing times R> timeEls <- querySelectorAll(page, "time")
Now that we have gathered the correct elements, it is reasonably simple to manipulate them using the XML package. We want to extract the correct attributes and values from the selected nodes. The code below shows how we would do this.
R> # Collect all URLs R> urls <- sapply(links, function(x) xmlGetAttr(x, "href")) R> # Collect all titles R> titles <- sapply(links, xmlValue) R> # Collect all datetime attributes R> dates <- sapply(timeEls, function(x) xmlGetAttr(x, "datetime")) R> # To play nice with R, lets parse it as a Date R> dates <- as.Date(dates) R> # Create a data frame of the results R> technicalReports <- data.frame(title = titles, R+ url = urls, R+ date = dates, R+ stringsAsFactors = FALSE) R> # and show one column at a time R> technicalReports$title
 "2012-9 Writing grid Extensions"  "2012-8 Meta-analysis of a rare-variant association test"  "2012-7 A Structured Approach for Generating SVG"  "2012-6 Working with the gridSVG Coordinate System"  "2012-5 Voronoi Treemaps in R"  "2012-4 Two-sample rank tests under complex sampling"  "2012-3 An empirical-process central limit theorem for complex sampling under bounds on the design effect"  "2012-2: Two-phase subsampling designs for genomic resequencing studies"  "2012-1: Partial Likelihood Ratio Tests for the Cox model under Complex Sampling"
 "http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/"  "http://stattech.wordpress.fos.auckland.ac.nz/2012-8-meta-analysis-of-a-rare-variant-association-test/"  "http://stattech.wordpress.fos.auckland.ac.nz/2012-7-a-structured-approach-for-generating-svg/"  "http://stattech.wordpress.fos.auckland.ac.nz/2012-6-working-with-the-gridsvg-coordinate-system/"  "http://stattech.wordpress.fos.auckland.ac.nz/voronoi-treemaps-in-r/"  "http://stattech.wordpress.fos.auckland.ac.nz/two-sample-rank-tests-under-complex-sampling/"  "http://stattech.wordpress.fos.auckland.ac.nz/an-empirical-process-central-limit-theorem-for-complex-sampling-under-bounds-on-the-design-effect/"  "http://stattech.wordpress.fos.auckland.ac.nz/2012-2-two-phase-subsampling-designs-for-genomic-resequencing-studies/"  "http://stattech.wordpress.fos.auckland.ac.nz/2012-1-partial-likelihood-ratio-tests-for-the-cox-model-under-complex-sampling/"
 "2012-11-06" "2012-11-04" "2012-10-15" "2012-10-10" "2012-09-19"  "2012-06-20" "2012-06-20" "2012-05-24" "2012-05-24"
An example (see “XPath”) written for the gridSVG package  will be revisited. The example first shows a ggplot2  plot that has been exported to SVG using gridSVG. The aim is to then remove the legend from the plot by removing the node containing all legend information. Once the node has been removed, the resulting document can be saved to produce an image with a legend removed.
What is of particular interest with this example is that it uses SVG, which is a namespaced XML document. This provides some challenges that require consideration, but the selectr package can handle this case.
R> library(ggplot2) R> library(gridSVG) R> qplot(mpg, wt, data=mtcars, colour=cyl) R> svgdoc <- gridToSVG(name=NULL, "none", "none")$svg
So far we have simply reproduced the original plot and stored
the resulting XML in a node tree called
order to remove the legend from the plot we first need to select
the legend node from the SVG document. We will compare
the XML-only approach with one enhanced
with selectr. The comparison is shown below:
R> # XPath R> legendNode <- getNodeSet(svgdoc, R+ "//svg:g[@id='layout::guide-box.3-5-3-5.1']", R+ c(svg = "http://www.w3.org/2000/svg"))[] R> # CSS R> legendNode <- querySelector(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"), R+ prefix = "//svg:*/descendant-or-self::")
This particular example demonstrates a case where the XPath
approach is more concise. This is because the
attribute that we're searching for needs to have its CSS selector
escaped (due to
. being special
characters in CSS), while the XPath expression remains
unchanged. Additionally, we also need to specify a namespace-aware
prefix for the XPath that is generated. To use CSS selectors in
this case required knowledge of XPath that would rather be
To work around this issue, a namespace-aware function should be
used instead to abstract away the XPath dependent code. The
following code demonstrates the use of selectr’s
R> legendNode <- querySelectorNS(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"))
The resulting use of CSS selection is now as concise as the XPath version, with the only special consideration being the requirement of escaping the CSS selector.
Now that the legend has been selected, we can remove it from the SVG document to produce an image with a legend omitted.
R> removeChildren(xmlParent(legendNode), legendNode) R> saveXML(svgdoc, file = NULL)
This article describes the new selectr package. Its main purpose is to allow the use of CSS selectors in a domain which previously only allowed XPath. In addition, convenience functions have also been described; allowing easy use of CSS selectors for the purpose of retrieving parts of an XML document. It has been demonstrated that the selectr package augments the XML package with the ability to use a more concise language for selecting content from an XML document.
This document is licensed under a Creative Commons Attribution 3.0 New Zealand License . The code is freely available under the GPL. The described functionality of selectr is present in version 0.2-0. selectr is available on CRAN and development occurs on GitHub at https://github.com/sjp/selectr.