Scraping a javascript generated page using Python

Question

I need to scarpe some information for https://hasjob.co/, I can scrape some of the information by getting through the login page and scrape as usual, but most of information are generated by Javascript only when u scroll down to the bottom of the page.

Any solution using python??

import mechanize
import cookielib
from bs4 import BeautifulSoup
import html2text

import pprint

job = []

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

br.addheaders = [('User-agent', 'Chrome')]

# The site we will navigate into, handling it's session
br.open('https://auth.hasgeek.com/login')

# View available forms
##for f in br.forms():
##    print f

# Select the second (index one) form (the first form is a search query box)
br.select_form(nr=1)

# User credentials
br.form['username'] = 'username'
br.form['password'] = 'pass'

br.submit()

##print(br.open('https://hasjob.co/').read())

r = br.open('https://hasjob.co/')


soup = BeautifulSoup(r)


for tag in soup.find_all('span',attrs={'class':'annotation bottom-right'}):

    p = tag.text
    job.append(p)


pp = pprint.PrettyPrinter(depth=6)

pp.pprint(job)

score 2 · Accepted Answer · answered Feb 14 '15 at 14:55

For some reason almost no one notices that Hasjob has an Atom feed and it's linked from the home page. Reading structured data from Hasjob using the feedparser library is as simple as:

import feedparser
feed = feedparser.parse('https://hasjob.co/feed')
for job in feed.entries:
    print job.title, job.link, job.published, job.content

The feed used to be full 30 days, but that's now over 800 entries and a fair bit of load on the server, so I've cut it down to the last 24 hours of jobs. If you want a regular helping of jobs, just load from this URL at least once a day.

score 0 · Answer 2 · edited May 23 '17 at 10:33

You could take a look at the python module PyV8, it is a python wrapper for the Google V8 javascript engine.

You could also try using ghostdriver via selenium, see example here: Selenium with GhostDriver in Python on Windows. With selenium you have the option to run a visual browser instance in either Firefox or Chrome (via chromedriver) while you're getting things to work and then switch to PhantomJS (windowless browser) when your scraper is working. Note though that creating a full browser instance is propably a complete overkill, although it really depends on what you're doing. If you're not running it too frequently I guess it's fine, but normally selenium is used for browser testing and not for scraping.

Scraping a javascript generated page using Python

2 Answers2

Linked