18.03.2013
by Esa Turtiainen
tags: XSLT R

How to parse data from HTML?

I often have needed some values from a HTML page.

For example, we find out that there is the wanted information in the HTML table:

<h1>Here is the price</h1>
<table><tr><td>1.0</td></tr></table>

What I am looking for is in the table after the <h1> containing text "Here is the price", and in this particular example, I am interested in the first element in the table.

I want to retrieve 1.0 from the previous HTML.

Regular expression matching is not a good way to find the right value. Regular expression is not able to understand the hierarchical nature of the XML tags and it can go totally wrong.

It is better to do this using some system that understands the hierarchical XML structure.

XSLT

I have tried to use XSLT (Extensible Stylesheet Language Transformations) to this purpose several times. It has always failed. I tried to do that again today for several hours, but it turned to be impossible, again.

As a warning, I explain here why this is so painful.

At theory it should work like this: I write an simple XSLT program file that describes what to match using powerful XPATH commands and when the XPATH matches, I print what was found. I write a rule like:

<xsl:template match="//text(h1)==&quot;Here is the price&quot;/table/tr/td">
  <xsl:value-of select="text(.)"/>
</xsl:template>

In the ideal world, this matches anywhere in the document h1 tags where text content is "Here is the price" and finds it’s subtag table/tr/td. Then we print the text of that node.

XPATH can even index. It allows indexing 5th row and 4th column or even select the last column (last()).

This is about valid XSLT. In an ideal world you could run it using the command:

xsltproc select.xslt target.html

And this nicely prints the wanted value.

But no, this has never worked. Today I spent several hours to understand why. I have a theory, but still not a perfect answer.

I have a feeling that several people have tried this but because there is no discussions in forums, they have failed without understanding why.

The most important answer is in StackOverflow, but it has only four votes. All the discussions of XSLT are years old, so we are hopefully getting rid of XSLT.

The most important hint was that HTML has separate namespace. You must prefix all HTML tags with prefix that is defined in the header. After that the basic examples started to work somehow.

Now our simple example looks like this (only to print text of h1):

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xsl:template match="//xhtml:h1">
  <xsl:value-of select="text(.)"/>
</xsl:template>
</xsl:stylesheet>

It requires much more code to get this to work.

Then the next problem is that the default behavior of XSLT is completely stupid: it prints all non-matching XML without tags. The only way I figured out to prevent that is that your templates must match all the tags from the root to the tag what you are looking for.

You must not use handy "//" to match "any level" of tags.

To really understand why this is so painful, you should fully understand XSLT execution model. It is confusing and I have not yet fully understood it. Read this, and try to explain what means "A node is processed by finding all the template rules with patterns that match the node, and choosing the best amongst them".

Then you have problems of structures like "&quot;" in the text. They cause all kinds of errors. You also have to figure out, how to print new line character to the output and how to get rid of XML header of the output.

This all means that I gave up. XSLT is not a reasonable way to do this.

There may be other XML tools that can interpret XPATHs, but I did not find any that looked anyhow supported.

R

I found out that there is a very simple way to do this and it allows very complex cases very nicely.

In the statistical programming language R there is a library XML that is able to sniff HTML tables directly from web pages. They turn into a list of R data frames (handy table format).

For example, this 4 lines of code retrieves all the day’s electricity hour prices from German electricity exchange:

t = readHTMLTable("http://www.epexspot.com/en/market-data")
DE = t[[6]][,9]
DE.date = as.character(DE[1])
DE = as.numeric(as.character(na.omit(DE[-1])))

Latest German hour prices are in the 6th table in the page, in the 9th column. First row contains the date and the rest contain prices. Every second row is empty and they are read as value NA (not available). na.omit removes them. By default, the prices are not read as numbers but as "factors" (~enumerations). They can be converted to characters and then to numbers.

Here is the electricity price for Germany, March 18, 2013.

R plot of electricity price in Germany