Simon Potter simon.potter@auckland.ac.nz
Department of Statistics, University of Auckland
November 22, 2012
Abstract: The selectr package translates a CSS selector into an equivalent XPath expression. This allows the use of CSS selectors to query XML documents using the XML package. Convenience functions are also provided to mimic functionality present in modern web browsers.
When working with XML documents a common task is to be able to search for parts of a document that match a search query. For example, if we have a document representing a collection of books, we might want to search through for a book matching a certain title or author. A language has been created that constructs search queries on XML documents called XPath [1]. XPath is capable of constructing complex search queries but this is often at the cost of readability and terseness of the resulting expression.
An alternative way of searching for parts of a document is using CSS selectors [2]. These are most commonly used in web browsers to apply styling information to components of a web page. We can use the same language that selects which nodes to style in a web page to select nodes in an XML document. This often produces more concise and readable queries than the equivalent XPath expression. It must be noted however that XPath expressions are more flexible than CSS selectors, so although all CSS selectors have an equivalent XPath expression, the reverse is not true.
An advantage of using CSS selectors is that most people working with web documents such as HTML and SVG also know CSS. XPath is not employed anywhere beyond querying XML documents so is not a commonly known query language. Another important reason why CSS selectors are widely known is due to the common use of them in popular JavaScript libraries. jQuery [3] and D3 [4] are two examples where they select elements of a page to perform operations on using CSS selectors, rather than XPath. This is mostly due to the complexity of performing an XPath query in the browser in addition to the more verbose expressions. An example of how one would use CSS selectors to retrieve content using popular JavaScript libraries is with the following code:
The XML package [5] for R [6] is able to parse XML documents, which can later be queried using XPath. No facility exists for using CSS selectors on XML documents in the XML package. This limitation is due to the XML package’s dependence on the libxml2 [7] library which can only search using XPath. For reasons mentioned above, it would be ideal if we had the option of using CSS selectors to query such documents as well as XPath. If we can translate a CSS selector to XPath, then the restriction to only using XPath no longer applies and we can therefore use a CSS selector whenever an XPath expression is required.
A mature Python package exists that performs translation of CSS selectors to XPath expressions. Unfortunately this package, cssselect [8], cannot be used in R because it would require Python to be present on a user's system which cannot be guaranteed, particularly on Windows. The selectr package [9] is a translation of cssselect to R so that we have the same functionality within R as we would have using Python.
The rest of this article describes the process that selectr takes to translate CSS selectors to XPath expressions along with example usage.
The first step in translating any language to another is to first tokenise an expression into individual words, numbers, whitespace and symbols that represent the core structure of an expression. These pieces are called tokens. The following code shows character representation of tokens that have been created from tokenising a CSS selector expression:
R> tokenize("body > p")
[1] "<IDENT 'body' at 1>" [2] "<S ' ' at 5>" [3] "<DELIM '>' at 6>" [4] "<S ' ' at 7>" [5] "<IDENT 'p' at 8>" [6] "<EOF at 9>"
The selector body > p
is a query that looks for
all “p” elements within the document that are also
direct descendants of a “body” element. We can see
that the selector has been tokenised into 6 tokens. Each token has
the following structure: type, value, position. The type is the
type of token it is, an identifier, whitespace, a number or a
delimiter. The value is the actual text that a token represents,
while the position is simply the position along the string at
which the character was found.
Once we have the required tokens, it is necessary
to parse these tokens into a form that applies
meaning to the tokens. For example, in CSS a #
preceding an identifier means that we are looking for an element
with an ID matching that identifier. After parsing our tokens, we
have an understanding of what the CSS selector means and therefore
have the correct internal representation prior to translation to
XPath. The following code shows what our example selector is
understood to mean:
R> parse("body > p")
[1] "CombinedSelector[Element[body] > Element[p]]"
This shows that the selector is understood to be a combined
selector that matches when a p
element is a direct
descendant of a body
element. Once the parsing step
is complete, it is necessary to translate this internal
representation of a selector into its equivalent XPath
expression.
XPath is a superset of the functionality of CSS selectors, so
we can ensure that there is a mapping from CSS to XPath. Given
that we already know the parsed structure of the selector, we work
from the outer-most selector inwards. This means with the parsed
selector body > p
we look at
the CombinedSelector
first, then the
remaining Element
components. In this case we know
that the CombinedSelector
is going to map
to Element[body]/Element[p]
, which in turn
produces body/p
.
Some of these mappings are straightforward as was the case in the given example, but others are more complex. The table below shows a sample of the translations that occur:
CSS Selector | Parsed Structure | XPath Expression |
---|---|---|
#test |
Hash[Element[*]#test] |
*[@id = 'test'] |
.test |
Class[Element[*].test] |
*[@class and contains(concat(' ', normalize-space(@class), ' '), ' test ')] |
body p |
CombinedSelector[Element[body] <followed> Element[p]] |
body/descendant-or-self::*/p |
a[title] |
Attrib[Element[a][title]] |
a[@title] |
div[class^='btn'] |
Attrib[Element[div][class ^= 'btn']] |
div[@class and starts-with(@class, 'btn')] |
li:nth-child(even) |
Function[Element[li]:nth-child(['even'])] |
*/*[name() = 'li' and ((position() +0) mod 2 = 0 and position() >= 0)] |
#outer-div :first-child |
CombinedSelector[Hash[Element[*]#outer-div] <followed> Pseudo[Element[*]:first-child]] |
*[@id = 'outer-div']/descendant-or-self::*/*[position() = 1] |
These only touch on the possible translations, but it demonstrates that a mapping from CSS to XPath exists.
The selectr package becomes most useful when working
with the XML package. Most commonly selectr is
used to simplify the task of searching for a set of nodes. In the
browser, there are two JavaScript functions that perform this task
using CSS selectors, querySelector()
and querySelectorAll()
[10]. These
functions are methods on a document or element object. Typical
usage in the browser using JavaScript might be the following:
Because these are so commonly used in popular JavaScript libraries, the behaviour has been mimicked in selectr. The selectr package also provides these functions but instead of being methods on document or element objects, they are functions. These functions typically take two parameters, the XML object to be searched on, and the CSS selector to query with, respectively.
The difference between the two functions is
that querySelector()
will attempt to return
the first matching node or NULL
in
the case that no matches were
found. querySelectorAll()
will always return a list
of matching nodes, this list will be empty when there are no
matches. To demonstrate the usage of these functions, the
following XML document will be used:
R> library(XML) R> exdoc <- xmlRoot(xmlParse('<a><b class="aclass"/><c id="anid"/></a>')) R> exdoc
<a> <b class="aclass"/> <c id="anid"/> </a>
We will first see how querySelector()
is used.
R> library(selectr) R> querySelector(exdoc, "#anid") # Returns the matching node
<c id="anid"/>
R> querySelector(exdoc, ".aclass") # Returns the matching node
<b class="aclass"/>
R> querySelector(exdoc, "b, c") # First match from grouped selection
<b class="aclass"/>
R> querySelector(exdoc, "d") # No match
NULL
Now compare this to the
results returned by querySelectorAll()
:
R> querySelectorAll(exdoc, "b, c") # Grouped selection
[[1]] <b class="aclass"/> [[2]] <c id="anid"/> attr(,"class") [1] "XMLNodeSet"
R> querySelectorAll(exdoc, "b") # A list of length one
[[1]] <b class="aclass"/> attr(,"class") [1] "XMLNodeSet"
R> querySelectorAll(exdoc, "d") # No match
list() attr(,"class") [1] "XMLNodeSet"
The main point to get across is
that querySelector()
returns a node, querySelectorAll()
returns a list of nodes.
Both querySelector()
and querySelectorAll()
are paired with namespaced
equivalents, querySelectorNS()
and querySelectorAllNS()
respectively. These
functions will be demonstrated in more detail later in this
article.
While the aforementioned functions are certainly useful, they
do not cover all possible use cases. For other uses of CSS
selectors, the css_to_xpath()
function can be used
where an XPath expression would normally be
expected. The css_to_xpath()
function has three
parameters. The first parameter is simply the CSS selector, then a
prefix on the resulting XPath expression. This prefix is useful in
the circumstance when you already know some XPath and know where
the selector should be scoped to. The final parameter determines
the translator to use when translating selectors to XPath
expressions. The generic translator is sufficient in most cases
except when (X)HTML is used; in those cases a translator can be
used is aware of (X)HTML pseudo-selectors. A case
where css_to_xpath()
may be used is when
using XML's *apply
functions, as shown
below.
R> # Let's see all tag names present in the doc R> xpathSApply(exdoc, css_to_xpath("*"), xmlName)
[1] "a" "b" "c"
R> # What is the value of the class attribute on all "b" elements? R> xpathSApply(exdoc, css_to_xpath("b"), R+ function(x) xmlGetAttr(x, "class"))
[1] "aclass"
Rather than returning nodes, we are processing each node using a given function from the XML package, but specifying paths using CSS selectors instead.
While the example usage of the selectr package has been demonstrated earlier, the real-world usage may not be clear, or indeed the benefits over just using the XML package. To show how succinct it can be, we will try to create a data frame in R that lists the titles and URLs of technical reports hosted on the Department of Statistics Technical Report Blog, along with their publishing dates. First, lets examine part of the HTML that comprises the page to see how we're going to be selecting content.
... <article> <header> <h1 class="entry-title"> <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" title="Permalink to 2012-9 Writing grid Extensions" rel="bookmark">2012-9 Writing grid Extensions</a> </h1> <div class="entry-meta"> <span class="sep">Posted on </span> <a href="http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" title="9:48 pm" rel="bookmark"> <time class="entry-date" datetime="2012-11-06T21:48:17+00:00" pubdate> November 6, 2012 </time> </a> ...
This fragment shows us that we have the information available
to us, we just need to know how to query it. For example, we can
see that the URL to each technical report is in
the href
attribute within an a
element. In particular, this a
element has
an h1
parent with a class
of entry-title
. The a
element also
contains the title of the technical report. Similarly we can see
a time
element that tells us via
the datetime
attribute when the post was
published. We first start by loading the required packages and
retrieving the data so that we can work with it.
R> library(XML) R> library(selectr) R> page <- htmlParse("http://stattech.wordpress.fos.auckland.ac.nz/")
Now that the page has been parsed into a queryable form, we can
write the required CSS selectors to retrieve this information
using querySelectorAll()
.
R> # CSS selector to get titles and URLs: "h1.entry-title > a" R> links <- querySelectorAll(page, "h1.entry-title > a") R> # Now lets get all of the publishing times R> timeEls <- querySelectorAll(page, "time")
Now that we have gathered the correct elements, it is reasonably simple to manipulate them using the XML package. We want to extract the correct attributes and values from the selected nodes. The code below shows how we would do this.
R> # Collect all URLs R> urls <- sapply(links, function(x) xmlGetAttr(x, "href")) R> # Collect all titles R> titles <- sapply(links, xmlValue) R> # Collect all datetime attributes R> dates <- sapply(timeEls, function(x) xmlGetAttr(x, "datetime")) R> # To play nice with R, lets parse it as a Date R> dates <- as.Date(dates) R> # Create a data frame of the results R> technicalReports <- data.frame(title = titles, R+ url = urls, R+ date = dates, R+ stringsAsFactors = FALSE) R> # and show one column at a time R> technicalReports$title
[1] "2012-9 Writing grid Extensions" [2] "2012-8 Meta-analysis of a rare-variant association test" [3] "2012-7 A Structured Approach for Generating SVG" [4] "2012-6 Working with the gridSVG Coordinate System" [5] "2012-5 Voronoi Treemaps in R" [6] "2012-4 Two-sample rank tests under complex sampling" [7] "2012-3 An empirical-process central limit theorem for complex sampling under bounds on the design effect" [8] "2012-2: Two-phase subsampling designs for genomic resequencing studies" [9] "2012-1: Partial Likelihood Ratio Tests for the Cox model under Complex Sampling"
R> technicalReports$url
[1] "http://stattech.wordpress.fos.auckland.ac.nz/2012-9-writing-grid-extensions/" [2] "http://stattech.wordpress.fos.auckland.ac.nz/2012-8-meta-analysis-of-a-rare-variant-association-test/" [3] "http://stattech.wordpress.fos.auckland.ac.nz/2012-7-a-structured-approach-for-generating-svg/" [4] "http://stattech.wordpress.fos.auckland.ac.nz/2012-6-working-with-the-gridsvg-coordinate-system/" [5] "http://stattech.wordpress.fos.auckland.ac.nz/voronoi-treemaps-in-r/" [6] "http://stattech.wordpress.fos.auckland.ac.nz/two-sample-rank-tests-under-complex-sampling/" [7] "http://stattech.wordpress.fos.auckland.ac.nz/an-empirical-process-central-limit-theorem-for-complex-sampling-under-bounds-on-the-design-effect/" [8] "http://stattech.wordpress.fos.auckland.ac.nz/2012-2-two-phase-subsampling-designs-for-genomic-resequencing-studies/" [9] "http://stattech.wordpress.fos.auckland.ac.nz/2012-1-partial-likelihood-ratio-tests-for-the-cox-model-under-complex-sampling/"
R> technicalReports$date
[1] "2012-11-06" "2012-11-04" "2012-10-15" "2012-10-10" "2012-09-19" [6] "2012-06-20" "2012-06-20" "2012-05-24" "2012-05-24"
An example (see “XPath”) written for the gridSVG package [11] will be revisited. The example first shows a ggplot2 [12] plot that has been exported to SVG using gridSVG. The aim is to then remove the legend from the plot by removing the node containing all legend information. Once the node has been removed, the resulting document can be saved to produce an image with a legend removed.
What is of particular interest with this example is that it uses SVG, which is a namespaced XML document. This provides some challenges that require consideration, but the selectr package can handle this case.
R> library(ggplot2) R> library(gridSVG) R> qplot(mpg, wt, data=mtcars, colour=cyl) R> svgdoc <- gridToSVG(name=NULL, "none", "none")$svg
So far we have simply reproduced the original plot and stored
the resulting XML in a node tree called svgdoc
. In
order to remove the legend from the plot we first need to select
the legend node from the SVG document. We will compare
the XML-only approach with one enhanced
with selectr. The comparison is shown below:
R> # XPath R> legendNode <- getNodeSet(svgdoc, R+ "//svg:g[@id='layout::guide-box.3-5-3-5.1']", R+ c(svg = "http://www.w3.org/2000/svg"))[[1]] R> # CSS R> legendNode <- querySelector(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"), R+ prefix = "//svg:*/descendant-or-self::")
This particular example demonstrates a case where the XPath
approach is more concise. This is because the id
attribute that we're searching for needs to have its CSS selector
escaped (due to :
and .
being special
characters in CSS), while the XPath expression remains
unchanged. Additionally, we also need to specify a namespace-aware
prefix for the XPath that is generated. To use CSS selectors in
this case required knowledge of XPath that would rather be
avoided.
To work around this issue, a namespace-aware function should be
used instead to abstract away the XPath dependent code. The
following code demonstrates the use of selectr’s
namespace-aware function querySelectorNS()
:
R> legendNode <- querySelectorNS(svgdoc, R+ "#layout\\:\\:guide-box\\.3-5-3-5\\.1", R+ c(svg = "http://www.w3.org/2000/svg"))
The resulting use of CSS selection is now as concise as the XPath version, with the only special consideration being the requirement of escaping the CSS selector.
Now that the legend has been selected, we can remove it from the SVG document to produce an image with a legend omitted.
R> removeChildren(xmlParent(legendNode), legendNode) R> saveXML(svgdoc, file = NULL)
This article describes the new selectr package. Its main purpose is to allow the use of CSS selectors in a domain which previously only allowed XPath. In addition, convenience functions have also been described; allowing easy use of CSS selectors for the purpose of retrieving parts of an XML document. It has been demonstrated that the selectr package augments the XML package with the ability to use a more concise language for selecting content from an XML document.
This document is licensed under a Creative Commons Attribution 3.0 New Zealand License . The code is freely available under the GPL. The described functionality of selectr is present in version 0.2-0. selectr is available on CRAN and development occurs on GitHub at https://github.com/sjp/selectr.