Fork me on GitHub

pjscrape by Nick Rabinowitz

A web-scraping framework written in Javascript, using PhantomJS and jQuery

Overview

pjscrape is a framework for anyone who's ever wanted a command-line tool for web scraping using Javascript and jQuery. Built to run with PhantomJS, it allows you to scrape pages in a fully rendered, Javascript-enabled context from the command line, no browser required.

Features

  • Client-side, Javascript-based scraping environment with full access to jQuery functions
  • Easy, flexible syntax for setting up one or more scrapers
  • Recursive/crawl scraping
  • Delay scrape until a "ready" condition occurs
  • Load your own scripts on the page before scraping
  • Modular architecture for logging and writing/formatting scraped items
  • Client-side utilities for common tasks
  • Growing set of unit tests

In its most concise syntax, pjscrape makes scraping a webpage as easy as this:

pjs.addSuite({
    // url to scrape
    url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
    // selector to look for
    scraper: '#sortable_table_id_0 tr td:nth-child(2)'
});
// Output: ["Addison","Albany","Alburgh", ...]

And crawling a set of webpages as easy as this:

pjs.addSuite({
    // url to start at
    url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
    // selector to find more urls to spider
    moreUrls: '#sortable_table_id_0 tr td:nth-child(2) a',
    maxDepth: 1,
    // function to get some data
    scraper: function() {
        return { 
            name: $('#firstHeading').text(),
            elevation: $('td:contains("Elevation") + td').text()
        }
    }
});
// Output: [{"name":"Addison, Vermont","elevation":"89 ft (27 m)"}, ...]

Ok, that's 14 lines with comments. But it's still a pretty simple API, right?

Quick Start

  1. Download and install PhantomJS or PyPhantomJS, v.1.2. In order to use file-based logging or data writes, you'll need to use PyPhantomJS with the Save to File plugin (though I think this feature will be rolled into the PhantomJS core in the next version).

  2. Make a config file (e.g. my_config.js) to define your scraper(s). Config files can set global pjscrape settings via pjs.config() and add one or more scraper suites via pjs.addSuite().

  3. A scraper suite defines a set of scraper functions for one or more URLs. A simple config file might look like this:

    pjs.addSuite({
        // single URL or array
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        // single function or array, evaluated in the client
        scraper: function() {
            return $('h1#firstHeading').text();
        }
    });
    

    A scraper this simple can also be added with the pjs.addScraper(url, scraper) convenience function.

  4. To run pjscrape from the command line, type:

    ~> pyphantomjs /path/to/pjscrape.js my_config.js
    * Suite 0 starting
    * Opening http://en.wikipedia.org/wiki/List_of_towns_in_Vermont
    * Scraping http://en.wikipedia.org/wiki/List_of_towns_in_Vermont
    * Suite 0 complete
    * Writing 1 items
    ["List of towns in Vermont"]
    * Saved 1 items
    

    By default, the log output is pretty verbose, and the scraped data is written as JSON to stdout at the end of the scrape.

  5. You can configure logging, formatting, and writing data using pjs.config():

    pjs.config({ 
        // options: 'stdout', 'file' (set in config.logFile) or 'none'
        log: 'stdout',
        // options: 'json' or 'csv'
        format: 'json',
        // options: 'stdout' or 'file' (set in config.outFile)
        writer: 'file',
        outFile: 'scrape_output.json'
    });
    

Tutorial

Writing Scrapers

The core of a pjscrape script is the definition of one or more scraper functions. Here's what you need to know:

  • Scraper functions are evaluated in a full browser context. This means you not only have access to the DOM, you have access to Javascript variables and functions, AJAX-loaded content, etc.

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return wgPageName; // variable set by Wikipedia
        }
    });
    // Output: ["List_of_towns_in_Vermont"]
    
  • Scraper functions are evaluated in a sandbox (read more here). Closures will not work the way you think:

    var myPrivateVariable = "test";
    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return myPrivateVariable;
        }
    });
    // CLIENT: ReferenceError: Can't find variable: myPrivateVariable
    

    The best way to think about your scraper functions is to assume the code is being eval()'d in the context of the page you're trying to scrape.

  • Scrapers have access to a set of helper functions in the _pjs namespace. See the Javascript API docs for more info. One particularly useful function is _pjs.getText(), which returns an array of text from the matched elements:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return _pjs.getText('#sortable_table_id_0 tr td:nth-child(2)');
        }
    });
    // Output: ["Addison","Albany","Alburgh", ...]
    

    For this instance, there's actually a shorter syntax - if your scraper is a string instead of a function, pjscrape will assume it is a selector and use it in a function like the one above:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: '#sortable_table_id_0 tr td:nth-child(2)'
    });
    // Output: ["Addison","Albany","Alburgh", ...]
    
  • Scrapers can return data in whatever format you want, provided it's JSON-serializable (so you can't return a jQuery object, for example). For example, the following code returns the list of towns in the Django fixture syntax:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return $('#sortable_table_id_0 tr').slice(1).map(function() {
                var name = $('td:nth-child(2)', this).text(),
                    county = $('td:nth-child(3)', this).text(),
                    // convert relative URLs to absolute
                    link = _pjs.toFullUrl(
                        $('td:nth-child(2) a', this).attr('href')
                    );
                return {
                    model: "myapp.town",
                    fields: {
                        name: name,
                        county: county,
                        link: link
                    }
                }
            }).toArray(); // don't forget .toArray() if you're using .map()
        }
    });
    /* Output: 
    [{"fields":{"link":"http://en.wikipedia.org/wiki/Addison,_Vermont",
    "county":"Addison","name":"Addison"},"model":"myapp.town"}, ...]
    */
    
  • Scraper functions can always access the version of jQuery bundled with pjscrape (currently v.1.6.1). If you're scraping a site that also uses jQuery, and you want the latest features, you can set noConflict: true and use the _pjs.$ variable:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        noConflict: true,
        scraper: function() {
            return [
                window.$().jquery, // the version Wikipedia is using
                _pjs.$().jquery // the version pjscrape is using
            ];
        }
    });
    // Output: ["1.4.2","1.6.1"]
    

Asynchronous Scraping

Docs coming soon. For now, see:

Crawling Multiple Pages

Docs coming soon - the main thing is to set the moreUrls option to either a function or a selector that identifies more URLs to scrape. For now, see:

Bookmarket

Pjscrape includes a bookmarklet for loading jQuery and the Pjscrape client code into the current browser context. You can use this for testing scrapers in the browser - once you've run the bookmarklet, you can run pjs.addSuite in your console window.

To get the bookmarklet, just drag the following link to your bookmarks bar:

Load Pjscrape

pjscrape is (c) 2011 by Nick Rabinowitz. Comments welcomed at nick (at) nickrabinowitz (dot) com.