05.10.2014
by Esa Turtiainen
tags: R Javascript HTML

Some sources where I want to read information easily using R have turned into Javascript tables. That means that the data what I am looking for is not available easily in the HTML source any more.

The old trick:

t = readHTMLTable("http://www.epexspot.com/en/market-data")
DE = t[[6]][,9]

does not work any more. Actually, this data source still works, but some other sources do not. This code reads the latest German electricity price (all hours) from the page where there are the prices for the latest week in one table. The table is generated in the epexspot back end and the table is nicely in the HTML code.

What is more common nowadays is that the table is generated using Javascript in the viewer’s browser. The information is not there before the Javascript is evaluated.

It may be that the information is intentionally hidden. But it can not be hidden, what the user sees in the browser.

There are theree alternatives to get the information:

  1. Reverse engineer the Javascript and find out how it gets the primary information
  2. Evaluate the Javascript using Javascript engine
  3. Remote control full browser and ask it to tell what HTML is shown to the user after Javascript is evaluated

I have selected the third approach.

The Javascript program re-writes the HTML to such HTML where the table is. You need to remote control the browser to give you the HTML after Javascript is evaluated and it has re-written the HTML.

Remote controlled browsers are widely used for testing. Therefore, there are well developed systems to programmatically control a running browser. One such framework is Selenium.

http://www.seleniumhq.org/

Installin Relenium

Relenium is a R library to use Selenium.

http://lluisramon.github.io/relenium/

To use it, we need an impressive stack of software:

  1. Relenium, that is interface library in R
  2. R java interface library, because Selenium is written in Java
  3. Selenium software
  4. Webdriver - a part of selenium that interacts with Firefox, WWW-consortium standard
  5. Firefox browser

I don’t go through all the dependencies and version problems I had in installation. Just couple of the deserves mention.

First of all, the selenium that is currently available in repositories works only with Firefox 30 when the current version is 32. With Firefox 32 Selenium starts the Firefox but hangs there.

If your application runs in a normal server that is "headless" - it does not have graphical user interface, read the chapter "making Firefox headless".

Easy installation of relenium requires R devtools package. It has a non-trivial dependency of development libraries of curl package in Linux. In Debian it is libcurl4-openssl-dev. In Fedora it is libcurl-devel.

Another dependency is a fully installed proprietary Java SDK that is painful in most Linuxes. In older Debian squeeze, Java 6 is available in non-free but in newer Debians it is removed due to license restrictions. And even in Debian squeeze, remember to run manually:

sudo update-java-alternatives -s java-6-sun

after installing the package. In other Linuxes you must update every alternative separately (jar, javac, javah, …) after installing JDK from Oracle site. Alternetives must work right to get rJava installed.

Java environment is configured for R using command:

sudo R CMD javareconf

After the Linux dependencies are done, the installation of relenium is like

install.packages("rJava")
install.packages("devtools")
require(devtools)
install_github('seleniumJars', 'LluisRamon')
install_github('relenium', 'LluisRamon')

Selenium is packaged with relenium (‘seleniumJars’), so it does not need separate installation.

Using Relenium

We can start the firefox, ask it to fetch a page and retrieve the HTML after Javascript evaluation using commands

require(relenium)
firefox = firefoxClass$new()
firefox$get("http://www.epexspot.com/en/market-data")
html = firefox$getPageSource()

And the result can be translated as before:

require(XML)
t = readHTMLTable(html)
DE = t[[6]][,9]
DE.date = as.character(DE[1])
DE = DE[seq(2, length(DE), 2)]
DE = as.numeric(as.character(DE))
DE = DE / 10.0

The code produces a vector of 24 hourly electricity prices of the latest day in Germany (as c/kWh). For example, the result today is

> DE
[1] 2.834 2.924 2.846 2.695 2.579 2.717 4.324 5.278 5.217 4.989 4.545 3.907
[13] 3.107 2.833 2.707 2.934 3.163 4.006 4.360 4.591 4.093 2.978 2.551 1.383

Making Firefox headless

If you are running the code in a normal server environment where there is no graphical user interface, Firefox does not find user interface to start. It is possible to define a virtual user interface that satisfies Firefox.

The virtual X user interface is xvfb.

In Debian:

sudo apt-get install xvfb

And to start the server:

Xvfb :10 -ac &

And to start the headless Firefox:

export DISPLAY=:10
firefox

It might be tricky to get the environment variable DISPLAY to the firefox started by relenium but luckily, if the DISPLAY is specified before starting R, the DISPLAY variable is mediated to the firefox started using relenium (shell->R->Java->Firefox).