Jun 6, 2012

On quite a few occasions through the years I've had to resort to the use of home-built scrapers for obtaining information that's only been available in messy web pages. Typically the approach has been to use some nasty string hacks or regular expressions.

After Firebug and jQuery came along, I've wished that I could use my routine web debugging techniques for scraping.

Then came cleverer solutions

When I moved to Pythonland a few years back, my scraping skills took a big bump upwards when I found out about BeautifulSoup. I was finally able to query the DOM model to some extent and programmatically extract information. The soup is very tasty, well made and served its purpose beautifully. But still, it didn't sit quite right.

Every time I needed a scraper, I spent an hour googling how I could use jQuery for the job, and last year I found PhantomJS. As stated on the web page, it is a headless WebKit with JavaScript API, which meant injecting jQuery into pages and going wild with selectors. My beef with PhantomJS is that it runs inside a browser, which means no easy access to npm or any of its modules.

Then there was server side JavaScript

In the same googling session, I came about jsdom which is a pure JavaScript implementation of the DOM model, adhering to the W3C specifications. There is even an example of loading jQuery into pages, and all of this on the server side. All of this juiciness runs on Node.js with the full arsenal of npm at my fingertips.

After hours playing with it, I was extremely satisfied and happy with it. It was what I had been looking for. My previous blog post is built on top of data using the fruit of my scraping adventures.

As others have pointed out, this approach isn't perfect and without flaws, but for the purposes I've thrown its way, it gets the job done.

An example

To show how simple and powerful this way of scraping is, I've cooked up a small scraper that scrapes all the stories of the front page of Hacker News.

The code for this scraper is quite straight forward.

  • Requests the front page.
  • Creates a jsdom environment.
  • Injects jQuery into it.
  • Waits for the DOM ready event.
  • Extracts the node containing all the stories.
  • Extracts a story at a time.

The output of the scraper is a simple JavaScript array containing

[{ title: "Bookmarklet to see YC / Reddit thread of any URL",
url: "http://see-reaction.appspot.com/index.html",
source: "see-reaction.appspot.com",
user: "theone",
points: 36,
comments: 8 }, ...]

Previous articles

Jun 6, 2012

On quite a few occasions through the years I've had to resort to the use of home-built scrapers for obtaining information that's only been available in messy web pages. Typically the approach has been to use some nasty string hacks or regular expressions.

After Firebug and jQuery came along, I've wished that I could use my routine web debugging techniques for scraping.

Then came cleverer solutions

When I moved to Pythonland a few years back, my scraping skills took a big bump upwards when I found out about BeautifulSoup. I was finally able to query the DOM model to some extent and programmatically extract information. The soup is very tasty, well made and served its purpose beautifully. But still, it didn't sit quite right.

Every time I needed a scraper, I spent an hour googling how I could use jQuery for the job, and last year I found PhantomJS. As stated on the web page, it is a headless WebKit with JavaScript API, which meant injecting jQuery into pages and going wild with selectors. My beef with PhantomJS is that it runs inside a browser, which means no easy access to npm or any of its modules.

Then there was server side JavaScript

In the same googling session, I came about jsdom which is a pure JavaScript implementation of the DOM model, adhering to the W3C specifications. There is even an example of loading jQuery into pages, and all of this on the server side. All of this juiciness runs on Node.js with the full arsenal of npm at my fingertips.

After hours playing with it, I was extremely satisfied and happy with it. It was what I had been looking for. My previous blog post is built on top of data using the fruit of my scraping adventures.

As others have pointed out, this approach isn't perfect and without flaws, but for the purposes I've thrown its way, it gets the job done.

An example

To show how simple and powerful this way of scraping is, I've cooked up a small scraper that scrapes all the stories of the front page of Hacker News.

The code for this scraper is quite straight forward.

  • Requests the front page.
  • Creates a jsdom environment.
  • Injects jQuery into it.
  • Waits for the DOM ready event.
  • Extracts the node containing all the stories.
  • Extracts a story at a time.

The output of the scraper is a simple JavaScript array containing

[{ title: "Bookmarklet to see YC / Reddit thread of any URL",
url: "http://see-reaction.appspot.com/index.html",
source: "see-reaction.appspot.com",
user: "theone",
points: 36,
comments: 8 }, ...]

More articles