Scraping HTML and JavaScript

Question

I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.

I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script> tag.

Any ideas how to do this?

and another ressource: http://stackoverflow.com/questions/22624255/how-to-scrape-search-results-if-returned-in-javascript-using-python/22630026#22630026 — Ehvince, Mar 31 '14 at 15:31
as a side note, selenium is much more lighweight than Ghost. — Ehvince, Mar 31 '14 at 15:32

bosnjak · Accepted Answer · 2019-03-25T15:37:42.713

4

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.

For example, you can use PhantomJS with Chrome or Firefox which both support headless mode.

For a more complete list of headless browsers check here.

edited Mar 25 '19 at 15:37

answered Mar 31 '14 at 14:34

bosnjak

8,424
2
21
47

i am able to get ghost work and load the page but what should i do get whole webpage out of it. the documentation describes a function get_page but it is not there even in the code itself. – user1934948 Apr 23 '14 at 15:10

score 2 · Answer 2 · edited May 23 '17 at 10:30

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
see if the website provides an API (e.g. walmart API)

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

Also see these resources:

Hope that helps.

score 0 · Answer 3 · answered Mar 31 '14 at 15:35

To get you started with selenium and BeautifulSoup:

Install phantomjs with npm (Node Package Manager):

apt-get install nodejs
npm install phantomjs

install selenium:

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup as usual:

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

score 0 · Answer 4 · answered Feb 20 '18 at 01:48

A very fast way would be to iterate through all the tags and get textContent This is the JS snippet:

page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent;

or in selenium/python:

import selenium
from selenium import webdriver
driver = webdriver.Chrome()

driver.get("http://ranprieur.com")
pagetext = driver.execute_script('page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent; return page;')

Scraping HTML and JavaScript

4 Answers4