275

I'm trying to develop a simple web scraper. I want to extract plain text without HTML markup. My code works on plain (static) HTML, but not when content is generated by JavaScript embedded in the page.

In particular, when I use urllib2.urlopen(request) to read the page content, it doesn't show anything that would be added by the JavaScript code, because that code isn't executed anywhere. Normally it would be run by the web browser, but that isn't a part of my program.

How can I access this dynamic content from within my Python code?


See also Can scrapy be used to scrape dynamic content from websites that are using AJAX? for answers specific to Scrapy.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
mocopera
  • 2,803
  • 3
  • 18
  • 10
  • 3
    Sounds like you might need something heavier, try Selenium or Watir. – wim Nov 08 '11 at 11:16
  • 3
    I've successfully done this in Java (I've used the Cobra toolkit http://lobobrowser.org/cobra.jsp) Since you want to hack in python (always a good choice) I recommend these two options: - http://www.packtpub.com/article/web-scraping-with-python-part-2 - http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/ – bpgergo Nov 08 '11 at 11:34
  • 16
    Please note that the [top-rated answer](https://stackoverflow.com/a/26440563/6243352) was last updated in 2017 and is out of date as of 2021 as PhantomJS and dryscrape have been deprecated. I recommend reading the entire thread before trying one of the techniques it recommends. – ggorlen Mar 30 '21 at 21:46

18 Answers18

247

EDIT Sept 2021: phantomjs isn't maintained any more, either

EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.

dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.

Once you have installed Phantom JS, make sure the phantomjs binary is available in the current path:

phantomjs --version
# result:
2.1.1

#Example To give an example, I created a sample page with following HTML code. (link):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

without javascript it says: No javascript support and with javascript: Yay! Supports javascript

#Scraping without JS support:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#Scraping with JS support:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

You can also use Python library dryscrape to scrape javascript driven websites.

#Scraping with JS support:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
avi
  • 9,292
  • 11
  • 47
  • 84
  • 17
    Sadly, no Windows support. – Expenzor Apr 17 '17 at 14:39
  • 1
    Any alternatives for those of us programming within Windows? – Hoshiko86 Jun 05 '17 at 19:54
  • 2
    `@Expenzor` I am working on windows. PhantomJS works fine. – Aakash Choubey Jan 12 '18 at 10:43
  • 32
    Worth noting PhantomJS has been discontinued and is no longer under active development in light of Chrome now supporting headless. Use of headless chrome/firefox is suggested. – sytech Mar 23 '18 at 20:42
  • @sytech is it? I see regular commits - https://github.com/ariya/phantomjs/commits/master – avi Mar 29 '18 at 12:26
  • 6
    I get the following warning: `Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead`. Maybe @sytech was talking about Selenium support for it? – jpmc26 Apr 30 '18 at 04:37
  • 4
    It's both selenium support and PhantomJS itself. https://github.com/ariya/phantomjs/issues/15344 – sytech Apr 30 '18 at 12:34
120

We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM.

Therefore we need to render the javascript content before we crawl the page.

As selenium is already mentioned many times in this thread (and how slow it gets sometimes was mentioned also), I will list two other possible solutions.


Solution 1: This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that.

What we will need:

  1. Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.

  2. Install Splash following the instruction listed for our corresponding OS.
    Quoting from splash documentation:

    Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

    Essentially we are going to use Splash to render Javascript generated content.

  3. Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash.

  4. Install the scrapy-splash plugin: pip install scrapy-splash

  5. Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the settings.py:

    Then go to your scrapy project’s settings.py and set these middlewares:

    DOWNLOADER_MIDDLEWARES = {
          'scrapy_splash.SplashCookiesMiddleware': 723,
          'scrapy_splash.SplashMiddleware': 725,
          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    

    The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):

    SPLASH_URL = 'http://localhost:8050'
    

    And finally you need to set these values too:

    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
  6. Finally, we can use a SplashRequest:

    In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:

    class MySpider(scrapy.Spider):
        name = "jsscraper"
        start_urls = ["http://quotes.toscrape.com/js/"]
    
        def start_requests(self):
            for url in self.start_urls:
            yield SplashRequest(
                url=url, callback=self.parse, endpoint='render.html'
            )
    
        def parse(self, response):
            for q in response.css("div.quote"):
            quote = QuoteItem()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            yield quote
    

    SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.


Solution 2: Let's call this experimental at the moment (May 2018)...
This solution is for Python's version 3.6 only (at the moment).

Do you know the requests module (well who doesn't)?
Now it has a web crawling little sibling: requests-HTML:

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

  1. Install requests-html: pipenv install requests-html

  2. Make a request to the page's url:

    from requests_html import HTMLSession
    
    session = HTMLSession()
    r = session.get(a_page_url)
    
  3. Render the response to get the Javascript generated bits:

    r.html.render()
    

Finally, the module seems to offer scraping capabilities.
Alternatively, we can try the well-documented way of using BeautifulSoup with the r.html object we just rendered.

John Moutafis
  • 22,254
  • 11
  • 68
  • 112
  • 1
    can you expand on how to get the full HTML content, with JS bits loaded, after calling .render()? I'm stuck after that point. I'm not seeing all the iframes that are injected into the page normally from JavaScript in the `r.html.html` object. – fIwJlxSzApHEZIl Dec 13 '18 at 20:24
  • @anon58192932 Since at the moment this is an experimental solution and I don't know what exactly you are trying to achieve as a result, I cannot really suggest anything... You can create a new question here on SO if you haven't worked out a solution yet – John Moutafis Jan 02 '19 at 13:57
  • 7
    I got this error: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead. – Joshua Stafford Apr 23 '19 at 15:59
  • 2
    @HuckIt this seems to be a known issue: https://github.com/psf/requests-html/issues/140 – John Moutafis Oct 15 '19 at 12:22
  • 1
    I have tried the first method, but I still can not see js rendered content? Can you please tell me what am I missing. – AlixaProDev Jul 12 '22 at 16:41
64

Maybe selenium can do it.

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source
amazingthere
  • 992
  • 8
  • 10
  • 4
    Selenium is really heavy for this kind of thing, that'd be unnecessarily slow and requires a browser head if you don't use PhantomJS, but this would work. – Joshua Hedges Jul 28 '17 at 16:27
  • @JoshuaHedges You can run other more standard browsers in headless mode. – reynoldsnlp Jan 09 '20 at 00:55
  • 9
    `options = webdriver.ChromeOptions() options.add_argument('--headless') driver = webdriver.Chrome(options=options)` – fantabolous Oct 15 '20 at 14:50
43

If you have ever used the Requests module for python before, I recently found out that the developer created a new module called Requests-HTML which now also has the ability to render JavaScript.

You can also visit https://html.python-requests.org/ to learn more about this module, or if you're only interested in rendering JavaScript then you can visit https://html.python-requests.org/?#javascript-support to directly learn how to use the module to render JavaScript using Python.

Essentially, once you correctly install the Requests-HTML module, the following example, which is shown on the above link, shows how you can use this module to scrape a website and render JavaScript contained within the website:

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('http://python-requests.org/')

r.html.render()

r.html.search('Python 2 will retire in only {months} months!')['months']

'<time>25</time>' #This is the result.

I recently learned about this from a YouTube video. Click Here! to watch the YouTube video, which demonstrates how the module works.

Josh Correia
  • 3,807
  • 3
  • 33
  • 50
SShah
  • 1,044
  • 8
  • 19
  • 3
    Should note that this module has support for Python 3.6 only. – nat5142 Oct 12 '18 at 15:56
  • 2
    Seems to be using chromium under the hood. Works great for me though – Sid Apr 27 '20 at 11:46
  • 2
    works for 3.9 too that means it works with 3.6 and greater. – DDStackoverflow Nov 20 '21 at 01:08
  • Works fine on a Raspberry Pi. Just link to the native Chromium browser. https://stackoverflow.com/questions/66588194/requests-html-results-in-oserror-errno-8-exec-format-error-when-calling-html – Li_W Jan 22 '22 at 11:11
  • 2
    The domain `'http://python-requests.org/'` is down, it would be nice if you could update your answer to demonstrate what `.search` does exactly. – Shayan Jun 09 '22 at 14:51
18

It sounds like the data you're really looking for can be accessed via a secondary URL called by some JavaScript on the primary page.

While you could try running JavaScript on the server to handle this, a simpler approach might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is. Then you can just query that URL directly for the data you are interested in.

Josh Correia
  • 3,807
  • 3
  • 33
  • 50
Stephen Emslie
  • 10,539
  • 9
  • 32
  • 28
  • @Kris Just in case anyone stumbles on this and wants to try it instead of something as heavy as selenium, here's a short example. [This](https://www.mcmaster.com/#95462a029/=1e4ommx) will open the part detail page for a hex nut on the McMaster-Carr website. Their website content is mostly fetched using Javascript and has very little native page information. If you open your browser developer tools, navigate to the Network tab, and refresh the page, you can see all the requests made by the page and find the relevant data (in this case the part detail html). – SweepingsDemon Aug 13 '18 at 18:02
  • [This](https://www.mcmaster.com/mv1533652829/WebParts/Content/ItmPrsnttnWebPart.aspx?partnbrtxt=95462A029&cntnridtxt=MainContent) is a different url found in the Firefox devtool Network tab which, if followed, contains the html for most of the part information and exposes some of the parameters required to easily navigate to other part information for easier scraping. This particular example is not particularly useful as the price is generated by another Javascript function, but should serve well enough as an introduction to anyone wanting to follow Stephen's advice. – SweepingsDemon Aug 13 '18 at 18:10
17

This seems to be a good solution also, taken from a great blog post

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process

# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links

# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href')
print raw_links
Robbie
  • 213
  • 2
  • 10
marbel
  • 7,560
  • 6
  • 49
  • 68
14

Selenium is the best for scraping JS and Ajax content.

Check this article for extracting data from the web using Python

$ pip install selenium

Then download Chrome webdriver.

from selenium import webdriver

browser = webdriver.Chrome()

browser.get("https://www.python.org/")

nav = browser.find_element_by_id("mainnav")

print(nav.text)

Easy, right?

seco
  • 33
  • 1
  • 8
Macnux
  • 453
  • 6
  • 12
11

You can also execute javascript using webdriver.

from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')

or store the value in a variable

result = driver.execute_script('var text = document.title ; return text')
ggorlen
  • 44,755
  • 7
  • 76
  • 106
Serpentr
  • 141
  • 1
  • 4
10

I personally prefer using scrapy and selenium and dockerizing both in separate containers. This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another. Here's an example:

Use the scrapy startproject to create your scraper and write your spider, the skeleton can be as simple as this:

import scrapy


class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://somewhere.com']

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0])


    def parse(self, response):

        # do stuff with results, scrape items etc.
        # now were just checking everything worked

        print(response.body)

The real magic happens in the middlewares.py. Overwrite two methods in the downloader middleware, __init__ and process_request, in the following way:

# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver

class SampleProjectDownloaderMiddleware(object):

def __init__(self):
    SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
    SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
    chrome_options = webdriver.ChromeOptions()

    # chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
    self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
                                   desired_capabilities=chrome_options.to_capabilities())


def process_request(self, request, spider):

    self.driver.get(request.url)

    # sleep a bit so the page has time to load
    # or monitor items on page to continue as soon as page ready
    sleep(4)

    # if you need to manipulate the page content like clicking and scrolling, you do it here
    # self.driver.find_element_by_css_selector('.my-class').click()

    # you only need the now properly and completely rendered html from your page to get results
    body = deepcopy(self.driver.page_source)

    # copy the current url in case of redirects
    url = deepcopy(self.driver.current_url)

    return HtmlResponse(url, body=body, encoding='utf-8', request=request)

Dont forget to enable this middlware by uncommenting the next lines in the settings.py file:

DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}

Next for dockerization. Create your Dockerfile from a lightweight image (I'm using python Alpine here), copy your project directory to it, install requirements:

# Use an official Python runtime as a parent image
FROM python:3.6-alpine

# install some packages necessary to scrapy and then curl because it's  handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev

WORKDIR /my_scraper

ADD requirements.txt /my_scraper/

RUN pip install -r requirements.txt

ADD . /scrapers

And finally bring it all together in docker-compose.yaml:

version: '2'
services:
  selenium:
    image: selenium/standalone-chrome
    ports:
      - "4444:4444"
    shm_size: 1G

  my_scraper:
    build: .
    depends_on:
      - "selenium"
    environment:
      - SELENIUM_LOCATION=samplecrawler_selenium_1
    volumes:
      - .:/my_scraper
    # use this command to keep the container running
    command: tail -f /dev/null

Run docker-compose up -d. If you're doing this the first time it will take a while for it to fetch the latest selenium/standalone-chrome and the build your scraper image as well.

Once it's done, you can check that your containers are running with docker ps and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container (here, it was SELENIUM_LOCATION=samplecrawler_selenium_1).

Enter your scraper container with docker exec -ti YOUR_CONTAINER_NAME sh , the command for me was docker exec -ti samplecrawler_my_scraper_1 sh, cd into the right directory and run your scraper with scrapy crawl my_spider.

The entire thing is on my github page and you can get it from here

tarikki
  • 2,166
  • 2
  • 22
  • 32
9

A mix of BeautifulSoup and Selenium works very well for me.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
    try:
        element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element

        html = driver.page_source
        soup = bs(html, "lxml")
        dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
    else:
        print("Couldnt locate element")

P.S. You can find more wait conditions here

Biarys
  • 1,065
  • 1
  • 10
  • 22
  • What's BeautifulSoup for? Selenium already has selectors and works on the live page. – ggorlen Aug 27 '21 at 23:27
  • @ggorlen to extract the text or other data. Selenium selectors are there to navigate elements on the page. This was the case when I used it. – Biarys Aug 29 '21 at 00:39
  • Selenium can extract data too after the element has been selected. See many answers on this page, such as [this](https://stackoverflow.com/a/48328276/6243352). – ggorlen Aug 29 '21 at 01:14
8

Using PyQt5

from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage
import sys
import bs4 as bs
import urllib.request


class Client(QWebEnginePage):
    def __init__(self,url):
        global app
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ""
        self.loadFinished.connect(self.on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print("Load Finished")

    def Callable(self,data):
        self.html = data
        self.app.quit()

# url = ""
# client_response = Client(url)
# print(client_response.html)
Ash Ishh
  • 551
  • 5
  • 15
  • 1
    +1, Thanks! This was the solution that worked for me, since selenium is a bit overkill for such a simple task and requests-html is only for python 3.6. I would recommend this solution over any other. – WhiteWood Jun 01 '21 at 17:17
  • 1
    The above code worked for me, but only after installing *QtWebEngineWidgets* separately. Install in this order: *pip install PyQt5* and afterwards: *pip install QtWebEngineWidgets* – NeuroMorphing Jul 20 '22 at 08:55
  • Is it possible to execute JS on a website with this? – MaxFrost Oct 01 '22 at 08:22
  • Yes https://stackoverflow.com/a/52100343 runJavaScript function should work post page load – Ash Ishh Oct 02 '22 at 17:26
4

You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few).
Sometimes you'll get what you need with just one of these modules.
Sometimes you'll need two, three, or all of these modules.
Sometimes you'll need to switch off the js on your browser.
Sometimes you'll need header info in your script.
No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure.
If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle.
Just keep searching how to try what with these modules and copying and pasting your errors into the Google.

4

As of late 2022, Pyppeteer is no longer maintained; consider playwright-python as an alternative.


Pyppeteer

You might consider Pyppeteer, a Python port of the Chrome/Chromium driver front-end Puppeteer.

Here's a simple example to show how you can use Pyppeteer to access data that was injected into the page dynamically:

import asyncio
from pyppeteer import launch


async def main():
    browser = await launch({"headless": True})
    [page] = await browser.pages()

    # normally, you go to a live site...
    #await page.goto("http://www.example.com")
    # but for this example, just set the HTML directly:
    await page.setContent("""
    <body>
    <script>
    // inject content dynamically with JS, not part of the static HTML!
    document.body.innerHTML = `<p>hello world</p>`; 
    </script>
    </body>
    """)
    print(await page.content()) # shows that the `<p>` was inserted

    # evaluate a JS expression in browser context and scrape the data
    expr = "document.querySelector('p').textContent"
    print(await page.evaluate(expr, force_expr=True)) # => hello world

    await browser.close()


asyncio.run(main())

See Pyppeteer's reference docs.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
3

Try accessing the API directly

A common scenario you'll see in scraping is that the data is being requested asynchronously from an API endpoint by the webpage. A minimal example of this would be the following site:

<body>
<script>
fetch("https://jsonplaceholder.typicode.com/posts/1")
  .then(res => {
    if (!res.ok) throw Error(res.status);
    
    return res.json();
  })
  .then(data => {
    // inject data dynamically via JS after page load
    document.body.innerText = data.title;
  })
  .catch(err => console.error(err))
;
</script>
</body>

In many cases, the API will be protected by CORS or an access token or prohibitively rate limited, but in other cases it's publicly-accessible and you can bypass the website entirely. For CORS issues, you might try cors-anywhere.

The general procedure is to use your browser's developer tools' network tab to search the requests made by the page for keywords/substrings of the data you want to scrape. Often, you'll see an unprotected API request endpoint with a JSON payload that you can access directly with urllib or requests modules. That's the case with the above runnable snippet which you can use to practice. After clicking "run snippet", here's how I found the endpoint in my network tab:

example network tab showing remote URL endpoint found with a search

This example is contrived; the endpoint URL will likely be non-obvious from looking at the static markup because it could be dynamically assembled, minified and buried under dozens of other requests and endpoints. The network request will also show any relevant request payload details like access token you may need.

After obtaining the endpoint URL and relevant details, build a request in Python using a standard HTTP library and request the data:

>>> import requests
>>> res = requests.get("https://jsonplaceholder.typicode.com/posts/1")
>>> data = res.json()
>>> data["title"]
'sunt aut facere repellat provident occaecati excepturi optio reprehenderit'

When you can get away with it, this tends to be much easier, faster and more reliable than scraping the page with Selenium, Playwright-Python, Scrapy or whatever the popular scraping libraries are at the time you're reading this post.

If you're unlucky and the data hasn't arrived via an API request that returns the data in a nice format, it could be part of the original browser's payload in a <script> tag, either as a JSON string or (more likely) a JS object. For example:

<body>
<script>
  var someHardcodedData = {
    userId: 1,
    id: 1,
    title: 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 
    body: 'quia et suscipit\nsuscipit recusandae con sequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'
  };
  document.body.textContent = someHardcodedData.title;
</script>
</body>

There's no one-size-fits-all way to obtain this data. The basic technique is to use BeautifulSoup to access the <script> tag text, then apply a regex or a parse to extract the object structure, JSON string, or whatever format the data might be in. Here's a proof-of-concept on the sample structure shown above:

import json
import re
from bs4 import BeautifulSoup

# pretend we've already used requests to retrieve the data, 
# so we hardcode it for the purposes of this example
text = """
<body>
<script>
  var someHardcodedData = {
    userId: 1,
    id: 1,
    title: 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 
    body: 'quia et suscipit\nsuscipit recusandae con sequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'
  };
  document.body.textContent = someHardcodedData.title;
</script>
</body>
"""
soup = BeautifulSoup(text, "lxml")
script_text = str(soup.select_one("script"))
pattern = r"title: '(.*?)'"
print(re.search(pattern, script_text, re.S).group(1))

Check out these resources for parsing JS objects that aren't quite valid JSON:

Here are some additional case studies/proofs-of-concept where scraping was bypassed using an API:

If all else fails, try one of the many dynamic scraping libraries listed in this thread.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • modern pages have an unmanageable number of asynchronous requests. This works only on smaller pages when you have an idea of what to look for. – anishtain4 Dec 18 '22 at 20:33
  • @anishtain4 the number of requests hardly matters if you use the search tool in dev tools to filter them for the particular piece of data you're looking for, as shown in this post. I've successfully used this technique on dozens of modern webpages, some of which are shown in case study links. Give it a try--it's a hugely overlooked technique that saves writing a ton of scraping code, when the API is otherwise unprotected. Even if you are using a dynamic scraper, often you want to bypass the often unstable DOM and work with requests/responses since you have the credentials and correct origin. – ggorlen Dec 18 '22 at 20:34
  • It was an interesting technique, I'll keep that in mind. Unfortunately, the site that I'm trying to scrape keeps bouncing me out. – anishtain4 Dec 19 '22 at 02:19
  • Yeah, it's not intended as a general-purpose solution, just an option that is nice when it works and is easy enough to check while you're scoping out how to get the data you want. The JS on the page is generally pulling data from a ` – ggorlen Dec 19 '22 at 02:21
2

Playwright-Python

Yet another option is playwright-python, a port of Microsoft's Playwright (itself a Puppeteer-influenced browser automation library) to Python.

Here's the minimal example of selecting an element and grabbing its text:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("http://whatsmyuseragent.org/")
    ua = page.query_selector(".user-agent");
    print(ua.text_content())
    browser.close()
ggorlen
  • 44,755
  • 7
  • 76
  • 106
1

As mentioned, Selenium is a good choice for rendering the results of the JavaScript:

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)

url = "https://www.example.com"
browser.get(url)

And gazpacho is a really easy library to parse over the rendered html:

from gazpacho import Soup

soup = Soup(browser.page_source)
soup.find("a").attrs['href']
emehex
  • 9,874
  • 10
  • 54
  • 100
1

I recently used requests_html library to solve this problem.

Their expanded documentation at readthedocs.io is pretty good (skip the annotated version at pypi.org). If your use case is basic, you are likely to have some success.

from requests_html import HTMLSession
session = HTMLSession()
response = session.request(method="get",url="www.google.com/")
response.html.render()

If you are having trouble rendering the data you need with response.html.render(), you can pass some javascript to the render function to render the particular js object you need. This is copied from their docs, but it might be just what you need:

If script is specified, it will execute the provided JavaScript at runtime. Example:

script = """
    () => {
        return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio,
        }
    } 
"""

Returns the return value of the executed script, if any is provided:

>>> response.html.render(script=script)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}

In my case, the data I wanted were the arrays that populated a javascript plot but the data wasn't getting rendered as text anywhere in the html. Sometimes its not clear at all what the object names are of the data you want if the data is populated dynamically. If you can't track down the js objects directly from view source or inspect, you can type in "window" followed by ENTER in the debugger console in the browser (Chrome) to pull up a full list of objects rendered by the browser. If you make a few educated guesses about where the data is stored, you might have some luck finding it there. My graph data was under window.view.data in the console, so in the "script" variable passed to the .render() method quoted above, I used:

return {
    data: window.view.data
}
  • 1
    It seems `requests_html` is no longer actively maintained (last update May 2020). It uses `pyppeteer` for rendering, which does seem to be actively maintained; it uses Chromium for rendering underneath. – VirtualScooter Jul 05 '21 at 17:30
-4

Easy and Quick Solution:

I was dealing with same problem. I want to scrape some data which is build with JavaScript. If I scrape only text from this site with BeautifulSoup then I ended with tags in text. I want to render this tag and wills to grab information from this. Also, I dont want to use heavy frameworks like Scrapy and selenium.

So, I found that get method of requests module takes urls, and it actually renders the script tag.

Example:

import requests
custom_User_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
url = "https://www.abc.xyz/your/url"
response = requests.get(url, headers={"User-Agent": custom_User_agent})
html_text = response.text

This will renders load site and renders tags.

Hope this will help as quick and easy solution to render site which is loaded with script tags.

HITESH GUPTA
  • 149
  • 4