5

I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.

I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script> tag.

Any ideas how to do this?

Patrick Hofman
  • 153,850
  • 22
  • 249
  • 325
user1934948
  • 311
  • 1
  • 4
  • 15
  • and another ressource: http://stackoverflow.com/questions/22624255/how-to-scrape-search-results-if-returned-in-javascript-using-python/22630026#22630026 – Ehvince Mar 31 '14 at 15:31
  • as a side note, selenium is much more lighweight than Ghost. – Ehvince Mar 31 '14 at 15:32

4 Answers4

4

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.

For example, you can use PhantomJS with Chrome or Firefox which both support headless mode.

For a more complete list of headless browsers check here.

bosnjak
  • 8,424
  • 2
  • 21
  • 47
  • i am able to get ghost work and load the page but what should i do get whole webpage out of it. the documentation describes a function get_page but it is not there even in the code itself. – user1934948 Apr 23 '14 at 15:10
2

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

  • using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
  • use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
  • see if the website provides an API (e.g. walmart API)

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

Also see these resources:

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

To get you started with selenium and BeautifulSoup:

Install phantomjs with npm (Node Package Manager):

apt-get install nodejs
npm install phantomjs

install selenium:

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup as usual:

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)
Ehvince
  • 17,274
  • 7
  • 58
  • 79
0

A very fast way would be to iterate through all the tags and get textContent This is the JS snippet:

page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent; 

or in selenium/python:

import selenium
from selenium import webdriver
driver = webdriver.Chrome()

driver.get("http://ranprieur.com")
pagetext = driver.execute_script('page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent; return page;')

enter image description here

Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179