Greg Foletta - Bits and Blobs

A melange of networking, data science and programming.

Byte-size: An Ode to Web Scraping with R

Last week I needed to pull some data from a website. I was building out a data pipeline to generate the end-of-season awards for Little Athletics, but I ran into a problem. Like dependable friend, R came to the rescue with a simple, elegant solution. This post is a ‘byte-size’ ode to this dependable friend.

Beauty and Terseness

I’ll get to the challenge I ran into shortly, but first we’ll take a look at what basic web scraping looks like. Suppose you want to get all the headlines from The Age’s website. You look at the source and see that all the <a> tags have an attribute data-testid equal to article-link. Here’s the pipeline that acheives this:

request('http://theage.com.au') |>
    req_perform() |>
    resp_body_html() |>
    html_elements(xpath = "//a[@data-testid='article-link']") |>
    html_text() |>
    tibble(.name_repair = ~c('headline')) |>
    filter(headline != "") |>
    slice_head(n = 10) |>
    gt()
headline
Target Time
Get 2-for-1 Comedy Festival tickets*
The Morning Edition podcast
The 100 most expensive Melbourne public schools revealed
$4b on a new station in Melbourne’s west – has Victoria lost its budgetary mind?
Daily atrocities spew out of the Oval Office. It’s still not enough to lure Australian voters back to the known
Trump administration upends US foreign policy, holds secret talks with Hamas
Trump’s speech was full of wild claims. Here are seven that weren’t true
US lands new blow on Ukraine after Oval Office stoush
The number one reason why people are leaving Melbourne

There we go, five lines of R and you’ve got the headlines, plus a couple more to get it into a nicer structure. What makes it so simple? I think it comes down to two things: number one is R’s pipe operator, which menas you don’t have to pepper your code with temporary variables. Second is R’s vectorisation, which means you don’t need to worry about any loops. I also tip my hat to the relatively new httr2 package which makes web requests fit much better into a pipeline.

The Challenge

The challenge I ran into last week was that, while the data I needed was structured, it wasn’t in HTML, XML, or even JSON, it was actually JavaScript. Here’s an abridged sample of what was returned in one of the API calls:

sessions_NMRKeilor = [
  {"SessNbr":"1","SessPtr":"24","SessName":"Sat Morning - Field","SessDay":"1","SessTime":"30600"},
  {"SessNbr":"2","SessPtr":"25","SessName":"Sat Morning - Track","SessDay":"1","SessTime":"32400"},
  {"SessNbr":"3","SessPtr":"46","SessName":"Sat Afternoon - Field","SessDay":"1","SessTime":"46800"},
  ...
]

Not sure what to do, I fetch the data:

js_content <-
    request('https://lavic.resultshub.com.au/php/resultsFileFetch.php?season=2024&series=regions&round=3&venue=undefined') |>
    req_perform() |>
    resp_body_string()

The mind initially goes to dark places: can I solve this with a regex? Maybe filter out the variable assignment portions and parse as JSON? Pulling myself together, I think “this is just JavaScript, is there a way I can simply evaluate it?”. Some research shows that there’s an R library that provides API access into Google’s V8 JavaScript implementation. This should allow me to evaluate the JavaScript code we received:

jscontext <- v8()
jscontext$eval(js_content)

From there I can get the variable I need, and the library converts this from JSON to a clean data frame:

jscontext$get('sessions_NMRKeilor') |>
    gt()
SessNbrSessPtrSessNameSessDaySessTime
124Sat Morning - Field130600
225Sat Morning - Track132400
346Sat Afternoon - Field146800
430Sat Afternoon - Track146800
543Sun Morning - Field230600
638Sun Morning - Track230600
747Sun Afternoon - Field246800
845Sun Afternoon - Track247700

That’s it: web request to JavaScript evaluation to structured R data in a few lines; a thing of beauty.