PhantomJS – site scraping and pages captures

The official site saying that phantomjs is: “a headless WebKit with JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.”.
This could be use eg. to capture pages that running a lot of javascript, then get the source and parse data. All that in command line and without output to the screen.

Lets set phantomjs on Debian based system.

1. Compile phantomjs from source.

sudo apt-get install libqt4-dev qt4-qmake
wget http://phantomjs.googlecode.com/files/phantomjs-1.4.1-source.tar.gz
tar zxvf phantomjs-1.4.1-source.tar.gz
cd phantomjs-1.4.1
sudo cp bin/phantomjs /usr/bin/phantomjs

2. Prepare X virtual framebuffer (Xvfb) for pages rendering.
Xvfb or X virtual framebuffer is an X11 server that performs all graphical operations in memory, not showing any screen output.

sudo apt-get install xvfb xserver-xorg

3. Create start script for xvfb.


#! /bin/sh

# Provides: Xvfb
# Required-Start: $local_fs $remote_fs
# Required-Stop:
# X-Start-Before:
# Default-Start: 2 3 4 5
# Default-Stop:


set -e

case "$1" in
Xvfb :0 -screen 0 1024x768x24 &
echo "Usage: $N {start|stop|restart|force-reload}" >&2exit 1

exit 0

4. Start xvfb and test phantomjs

/etc/init.d/xvfb start
DISPLAY=:0 phantomjs --version

If you get a version number of phantomjs, everything is ok! Now we can do almost everything with webpages:
API Reference
Quick Start

