More Web Scraping with Phantomjs & CoffeeScript // galvanist

I’ve been playing with Phantomjs and CoffeeScript a bit more. In this example I load the first page of google news, extract all the links and link text, and write them to stdout in markdown format.

# headlines: generate markdown-formatted output with just the headlines
# and links from the main google news page (no sections, nothing fancy)
page_url = 'http://news.google.com/'
jquery_url = 'http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js'
page = require("webpage").create()
	
# this is a handler that relays console messages from inside our page to
# the phantomjs console. I've made it ignore everything that isn't
# prefixed with the word 'headline:'
page.onConsoleMessage = (msg) ->
    if msg.substring(0, 9) is "headline:"
        console.log(" - " + msg.substring(9))
	
	
page.open page_url, (status) ->
    if status isnt "success"
        console.log "# Error: couldn't load the page"
        phantom.exit()
    console.log "# The News\n"
    page.includeJs jquery_url, ->
        page.evaluate ->
            $('a.article').has('span.titletext').each (indx, elem) ->
                title = $(elem).text()
                href = $(elem).attr 'href'
                console.log "headline:[#{title}](#{href})"
        phantom.exit()  # I need this here to let the evaluate finish

Getting to know CoffeeScript

The combination of CoffeeScript an jQuery feels natural. But the overall program feels pretty brittle, and the phantomjs error messages that were printed while I was writing this little script weren’t very helpful. To be fair, part of this relates to CoffeeScript. I’m still getting used to it, and I’m in the phase where overlooked unbalanced parens or unintentionally ambiguous expressions cost me lots of time to debug.

For example this looked ok to me at first:

if msg.substring 0 9 is "headline"

but of course that compiles to

if (msg.substring(0,9 === "headline")) {

so instead I had to use the parens to disambiguate

if msg.substring(0, 9) is "headline"

which makes me wonder if it is probably just a good idea to always use parens in CoffeeScript, but that’s less beautiful…

Scrape The News

The script runs like this:

$ phantomjs headlines.coffee >headlines.md
$ head headlines.md
# The News
	
 - [Ukraine to seek UN support to demilitarise Crimea](http://www.irishtimes.com/news/world/europe/ukraine-to-seek-un-support-to-demilitarise-crimea-1.1730599)
 - [US President Barack Obama rules out 'military excursion' in Ukraine](http://timesofindia.indiatimes.com/world/europe/US-President-Barack-Obama-rules-out-military-excursion-in-Ukraine/articleshow/32332949.cms)
 - [Ukraine plans to pull troops from Crimea](http://theadvocate.com/home/8678488-125/ukraine-plans-to-pull-troops)
 - [For Moscow, Crimea may prove an expensive prize (+video)](http://www.csmonitor.com/World/Europe/2014/0319/For-Moscow-Crimea-may-prove-an-expensive-prize)
 - [FACT SHEET: Ukraine-Related Sanctions](http://www.whitehouse.gov/the-press-office/2014/03/17/fact-sheet-ukraine-related-sanctions)
 - [2014 Russian military intervention in Ukraine](http://en.wikipedia.org/wiki/2014_Russian_military_intervention_in_Ukraine)
 - [India Morning Call-Global Markets](http://www.reuters.com/article/2014/03/20/morningcall-india-idUSL3N0MH0MZ20140320)
 - [Fed ponders when to raise rates: what Yellen has on her dashboard](http://www.csmonitor.com/Business/2014/0319/Fed-ponders-when-to-raise-rates-what-Yellen-has-on-her-dashboard)

I don’t actually have much use for scraping google news, but this bit of code shows me that my last tie to Perl, the extremely powerful module WWW::Mechanize, can be cut.