Simon Potter
simon.potter@auckland.ac.nz
Department of Statistics, University of Auckland
November 22, 2012
Abstract: The selectr package translates a CSS selector into an equivalent XPath expression. This allows the use of CSS selectors to query XML documents using the XML package. Convenience functions are also provided to mimic functionality present in modern web browsers.
When working with XML documents a common task is to be able to search for parts of a document that match a search query. For example, if we have a document representing a collection of books, we might want to search through for a book matching a certain title or author. A language has been created that constructs search queries on XML documents called XPath [1]. XPath is capable of constructing complex search queries but this is often at the cost of readability and terseness of the resulting expression.
An alternative way of searching for parts of a document is using CSS selectors [2]. These are most commonly used in web browsers to apply styling information to components of a web page. We can use the same language that selects which nodes to style in a web page to select nodes in an XML document. This often produces more concise and readable queries than the equivalent XPath expression. It must be noted however that XPath expressions are more flexible than CSS selectors, so although all CSS selectors have an equivalent XPath expression, the reverse is not true.
An advantage of using CSS selectors is that most people working with web documents such as HTML and SVG also know CSS. XPath is not employed anywhere beyond querying XML documents so is not a commonly known query language. Another important reason why CSS selectors are widely known is due to the common use of them in popular JavaScript libraries. jQuery [3] and D3 [4] are two examples where they select elements of a page to perform operations on using CSS selectors, rather than XPath. This is mostly due to the complexity of performing an XPath query in the browser in addition to the more verbose expressions. An example of how one would use CSS selectors to retrieve content using popular JavaScript libraries is with the following code:
The XML package [5] for R [6] is able to parse XML documents, which can later be queried using XPath. No facility exists for using CSS selectors on XML documents in the XML package. This limitation is due to the XML package’s dependence on the libxml2 [7] library which can only search using XPath. For reasons mentioned above, it would be ideal if we had the option of using CSS selectors to query such documents as well as XPath. If we can translate a CSS selector to XPath, then the restriction to only using XPath no longer applies and we can therefore use a CSS selector whenever an XPath expression is required.
A mature Python package exists that performs translation of CSS selectors to XPath expressions. Unfortunately this package, cssselect [8], cannot be used in R because it would require Python to be present on a user's system which cannot be guaranteed, particularly on Windows. The selectr package [9] is a translation of cssselect to R so that we have the same functionality within R as we would have using Python.
The rest of this article describes the process that selectr takes to translate CSS selectors to XPath expressions along with example usage.
The first step in translating any language to another is to first tokenise an expression into individual words, numbers, whitespace and symbols that represent the core structure of an expression. These pieces are called tokens. The following code shows character representation of tokens that have been created from tokenising a CSS selector expression:
R> tokenize("body > p")
[1] "<IDENT 'body' at 1>" [2] "<S ' ' at 5>" [3] "<DELIM '>' at 6>" [4] "<S ' ' at 7>" [5] "<IDENT 'p' at 8>" [6] "<EOF at 9>"
The selector body > p
is a query that looks for all
“p” elements within the document that are also direct
descendants of a “body” element. We can see that the selector
has been tokenised into 6 tokens. Each token has the following structure:
type, value, position. The type is the type of token it is, an identifier,
whitespace, a number or a delimiter. The value is the actual text that a
token represents, while the position is simply the position along the
string at which the character was found.
Once we have the required tokens, it is necessary to
parse these tokens into a form that applies meaning to
the tokens. For example, in CSS a #
preceding an identifier
means that we are looking for an element with an ID matching that
identifier. After parsing our tokens, we have an understanding of what the
CSS selector means and therefore have the correct internal representation
prior to translation to XPath. The following code shows what our example
selector is understood to mean:
R> parse("body > p")
[1] "CombinedSelector[Element[body] > Element[p]]"
This shows that the selector is understood to be a combined selector that
matches when a p
element is a direct descendant of a
body
element. Once the parsing step is complete, it is
necessary to translate this internal representation of a selector into its
equivalent XPath expression.
XPath is a superset of the functionality of CSS selectors, so we can
ensure that there is a mapping from CSS to XPath. Given that we already
know the parsed structure of the selector, we work from the outer-most
selector inwards. This means with the parsed selector
body > p
we look at the
CombinedSelector
first, then the remaining
Element
components. In this case we know that the
CombinedSelector
is going to map to
Element[body]/Element[p]
, which in turn produces
body/p
.
Some of these mappings are straightforward as was the case in the given example, but others are more complex. The table below shows a sample of the translations that occur:
CSS Selector | Parsed Structure | XPath Expression |
---|---|---|
#test |
Hash[Element[*]#test] |
*[@id = 'test'] |
.test |
Class[Element[*].test] |
*[@class and contains(concat(' ', normalize-space(@class), ' '),
' test ')]
|
body p |
CombinedSelector[Element[body] <followed> Element[p]]
|
body/descendant-or-self::*/p |
a[title] |
Attrib[Element[a][title]] |
a[@title] |
div[class^='btn'] |
Attrib[Element[div][class ^= 'btn']] |
div[@class and starts-with(@class, 'btn')] |
li:nth-child(even) |
Function[Element[li]:nth-child(['even'])] |
*/*[name() = 'li' and ((position() +0) mod 2 = 0 and position()
>= 0)]
|
#outer-div :first-child |
CombinedSelector[Hash[Element[*]#outer-div] <followed>
Pseudo[Element[*]:first-child]]
|
*[@id = 'outer-div']/descendant-or-self::*/*[position() =
1]
|
These only touch on the possible translations, but it demonstrates that a mapping from CSS to XPath exists.
The selectr package becomes most useful when working with the
XML package. Most commonly selectr is used to simplify
the task of searching for a set of nodes. In the browser, there are two
JavaScript functions that perform this task using CSS selectors,
querySelector()
and querySelectorAll()
[10]. These functions are methods on a document or
element object. Typical usage in the browser using JavaScript might be the
following:
Because these are so commonly used in popular JavaScript libraries, the behaviour has been mimicked in selectr. The selectr package also provides these functions but instead of being methods on document or element objects, they are functions. These functions typically take two parameters, the XML object to be searched on, and the CSS selector to query with, respectively.
The difference between the two functions is that
querySelector()
will attempt to return the
first matching node or NULL
in the case that
no matches were found. querySelectorAll()
will always return
a list of matching nodes, this list will be empty when there are no
matches. To demonstrate the usage of these functions, the following XML
document will be used:
R> library(XML) R> exdoc <- xmlRoot(xmlParse('<a><b class="aclass"/><c id="anid"/></a>')) R> exdoc
<a> <b class="aclass"/> <c id="anid"/> </a>
We will first see how querySelector()
is used.
R> library(selectr) R> querySelector(exdoc, "#anid") # Returns the matching node
<c id="anid"/>
R> querySelector(exdoc, ".aclass") # Returns the matching node
<b class="aclass"/>
R> querySelector(exdoc, "b, c") # First match from grouped selection
<b class="aclass"/>
R> querySelector(exdoc, "d") # No match
NULL
Now compare this to the results returned by
querySelectorAll()
:
R> querySelectorAll(exdoc, "b, c") # Grouped selection
[[1]] <b class="aclass"/> [[2]] <c id="anid"/> attr(,"class") [1] "XMLNodeSet"
R> querySelectorAll(exdoc, "b") # A list of length one
[[1]] <b class="aclass"/> attr(,"class") [1] "XMLNodeSet"
R> querySelectorAll(exdoc, "d") # No match
list() attr(,"class") [1] "XMLNodeSet"
The main point to get across is that querySelector()
returns
a node, querySelectorAll()
returns a
list of nodes.
Both querySelector()
and querySelectorAll()
are
paired with namespaced equivalents, querySelectorNS()
and
querySelectorAllNS()
respectively. These functions will be
demonstrated in more detail later in this article.
While the aforementioned functions are certainly useful, they do not cover
all possible use cases. For other uses of CSS selectors, the
css_to_xpath()
function can be used where an XPath expression
would normally be expected. The css_to_xpath()
function has
three parameters. The first parameter is simply the CSS selector, then a
prefix on the resulting XPath expression. This prefix is useful in the
circumstance when you already know some XPath and know where the selector
should be scoped to. The final parameter determines the translator to use
when translating selectors to XPath expressions. The generic translator is
sufficient in most cases except when (X)HTML is used; in those cases a
translator can be used is aware of (X)HTML pseudo-selectors. A case where
css_to_xpath()
may be used is when using XML's
*apply
functions, as shown below.
R> # Let's see all tag names present in the doc R> xpathSApply(exdoc, css_to_xpath("*"), xmlName)
[1] "a" "b" "c"
R> # What is the value of the class attribute on all "b" elements? R> xpathSApply(exdoc, css_to_xpath("b"), R+ function(x) xmlGetAttr(x, "class"))
[1] "aclass"
Rather than returning nodes, we are processing each node using a given function from the XML package, but specifying paths using CSS selectors instead.
While the example usage of the selectr package has been demonstrated earlier, the real-world usage may not be clear, or indeed the benefits over just using the XML package. To show how succinct it can be, we will try to create a data frame in R that lists the titles and URLs of technical reports hosted on the Department of Statistics Technical Report Blog, along with their publishing dates. First, lets examine part of the HTML that comprises the page to see how we're going to be selecting content.
... <article> <header> <h1 class="entry-title"> <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" title="Permalink to 2012-9 Writing grid Extensions" rel="bookmark">2012-9 Writing grid Extensions</a> </h1> <div class="entry-meta"> <span class="sep">Posted on </span> <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" title="9:48 pm" rel="bookmark"> <time class="entry-date" datetime="2012-11-06T21:48:17+00:00" pubdate> November 6, 2012 </time> </a> ...
This fragment shows us that we have the information available to us, we
just need to know how to query it. For example, we can see that the URL to
each technical report is in the href
attribute within an
a
element. In particular, this a
element has an
h1
parent with a class of entry-title
. The
a
element also contains the title of the technical report.
Similarly we can see a time
element that tells us via the
datetime
attribute when the post was published. We first
start by loading the required packages and retrieving the data so that we
can work with it.
R> library(XML) R> library(selectr) R> page <- htmlParse("http://stattech.wordpress.fos.auckland.ac.nz/")
Now that the page has been parsed into a queryable form, we can write the
required CSS selectors to retrieve this information using
querySelectorAll()
.
R> # CSS selector to get titles and URLs: "h1.entry-title > a" R> links <- querySelectorAll(page, "h1.entry-title > a") R> # Now lets get all of the publishing times R> timeEls <- querySelectorAll(page, "time")
Now that we have gathered the correct elements, it is reasonably simple to manipulate them using the XML package. We want to extract the correct attributes and values from the selected nodes. The code below shows how we would do this.
R> # Collect all URLs R> urls <- sapply(links, function(x) xmlGetAttr(x, "href")) R> # Collect all titles R> titles <- sapply(links, xmlValue) R> # Collect all datetime attributes R> dates <- sapply(timeEls, function(x) xmlGetAttr(x, "datetime")) R> # To play nice with R, lets parse it as a Date R> dates <- as.Date(dates) R> # Create a data frame of the results R> technicalReports <- data.frame(title = titles, R+ url = urls, R+ date = dates, R+ stringsAsFactors = FALSE) R> # and show one column at a time R> technicalReports$title
[1] "2012-9 Writing grid Extensions" [2] "2012-8 Meta-analysis of a rare-variant association test" [3] "2012-7 A Structured Approach for Generating SVG" [4] "2012-6 Working with the gridSVG Coordinate System" [5] "2012-5 Voronoi Treemaps in R" [6] "2012-4 Two-sample rank tests under complex sampling" [7] "2012-3 An empirical-process central limit theorem for complex sampling under bounds on the design effect" [8] "2012-2: Two-phase subsampling designs for genomic resequencing studies" [9] "2012-1: Partial Likelihood Ratio Tests for the Cox model under Complex Sampling"
R> technicalReports$url
[1] "http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" [2] "http://stattech.wordpress.fos.auckland.ac.nz/2012-8-meta-analysis-of-a-rare-variant-association-test/" [3] "http://stattech.wordpress.fos.auckland.ac.nz/2012-7-a-structured-approach-for-generating-svg/" [4] "http://stattech.wordpress.fos.auckland.ac.nz/2012-6-working-with-the-gridsvg-coordinate-system/" [5] "http://stattech.wordpress.fos.auckland.ac.nz/voronoi-treemaps-in-r/" [6] "http://stattech.wordpress.fos.auckland.ac.nz/two-sample-rank-tests-under-complex-sampling/" [7] "http://stattech.wordpress.fos.auckland.ac.nz/an-empirical-process-central-limit-theorem-for-complex-sampling-under-bounds-on-the-design-effect/" [8] "http://stattech.wordpress.fos.auckland.ac.nz/2012-2-two-phase-subsampling-designs-for-genomic-resequencing-studies/" [9] "http://stattech.wordpress.fos.auckland.ac.nz/2012-1-partial-likelihood-ratio-tests-for-the-cox-model-under-complex-sampling/"
R> technicalReports$date
[1] "2012-11-06" "2012-11-04" "2012-10-15" "2012-10-10" "2012-09-19" [6] "2012-06-20" "2012-06-20" "2012-05-24" "2012-05-24"
An example (see “XPath”) written for the gridSVG package [11] will be revisited. The example first shows a ggplot2 [12] plot that has been exported to SVG using gridSVG. The aim is to then remove the legend from the plot by removing the node containing all legend information. Once the node has been removed, the resulting document can be saved to produce an image with a legend removed.
What is of particular interest with this example is that it uses SVG, which is a namespaced XML document. This provides some challenges that require consideration, but the selectr package can handle this case.
R> library(ggplot2) R> library(gridSVG) R> qplot(mpg, wt, data=mtcars, colour=cyl) R> svgdoc <- gridToSVG(name=NULL, "none", "none")$svg
So far we have simply reproduced the original plot and stored the
resulting XML in a node tree called svgdoc
. In order to
remove the legend from the plot we first need to select the legend node
from the SVG document. We will compare the XML-only approach with
one enhanced with selectr. The comparison is shown below:
R> # XPath R> legendNode <- getNodeSet(svgdoc, R+ "//svg:g[@id='layout::guide-box.3-5-3-5.1']", R+ c(svg = "http://www.w3.org/2000/svg"))[[1]] R> # CSS R> legendNode <- querySelector(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"), R+ prefix = "//svg:*/descendant-or-self::")
This particular example demonstrates a case where the XPath approach is
more concise. This is because the id
attribute that we're
searching for needs to have its CSS selector escaped (due to
:
and .
being special characters in CSS), while
the XPath expression remains unchanged. Additionally, we also need to
specify a namespace-aware prefix for the XPath that is generated. To use
CSS selectors in this case required knowledge of XPath that would rather
be avoided.
To work around this issue, a namespace-aware function should be used
instead to abstract away the XPath dependent code. The following code
demonstrates the use of selectr’s namespace-aware function
querySelectorNS()
:
R> legendNode <- querySelectorNS(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"))
The resulting use of CSS selection is now as concise as the XPath version, with the only special consideration being the requirement of escaping the CSS selector.
Now that the legend has been selected, we can remove it from the SVG document to produce an image with a legend omitted.
R> removeChildren(xmlParent(legendNode), legendNode) R> saveXML(svgdoc, file = NULL)
This article describes the new selectr package. Its main purpose is to allow the use of CSS selectors in a domain which previously only allowed XPath. In addition, convenience functions have also been described; allowing easy use of CSS selectors for the purpose of retrieving parts of an XML document. It has been demonstrated that the selectr package augments the XML package with the ability to use a more concise language for selecting content from an XML document.
This document is licensed under a Creative Commons Attribution 3.0 New Zealand License . The code is freely available under the GPL. The described functionality of selectr is present in version 0.2-0. selectr is available on CRAN and development occurs on GitHub at https://github.com/sjp/selectr.