Data scraping from a webpage with javascript using python

Question

I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")

soup.find_all('h1')

But there's always an error along the line of:

D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
    resp.html.render()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
    content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
  File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
    content = await page.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
    return await frame.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
    '''.strip())
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
    raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.

Process finished with exit code 1

Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.

score 1 · Answer 1 · answered Jun 24 '19 at 23:47

1

As Ivan said, here you have full code: sleep=1, keep_page=True make the trick

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))

Response:

[<title>
    Milled wheat and wheat flour produced</title>]

answered Jun 24 '19 at 23:47

NBlack

306
1
7

hmm, i wish this what was i was getting, but i still seem to get the same error – facsasd Jun 24 '19 at 23:55
Did you tried with my code? I run in my console (Python 3.7) and its working like a charm. Please, paste your code now to fix it :) – NBlack Jun 25 '19 at 15:30
So... i did try your code... sometimes it works sometimes it doesn't and i honestly don't know why anymore – facsasd Jun 25 '19 at 16:15
I'll try to replicate it – facsasd Jun 25 '19 at 17:47
I tried 10 times one behind other and works...try to put sleep=2 (2 seconds) if your internet is slow up to 5 sec. sleep – Integer, if provided, of how many long to sleep after initial render. – NBlack Jun 26 '19 at 08:56

score 0 · Answer 2 · answered Jun 24 '19 at 23:39

0

Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.

resp.html.render(sleep=1, keep_page=True)

answered Jun 24 '19 at 23:39

Ivan Sveshnikov

319
4
10

I tried it out, i still seem to be getting a similar error – facsasd Jun 24 '19 at 23:57
You might try to increase `sleep` parameter. If your page is heavy and machine is slow, it can help. – Ivan Sveshnikov Jun 25 '19 at 19:12
Note for my future self, or, other people: I try specifying only `keep_page=True`, and it's enough to do the trick. – Nuclear03020704 Jan 06 '22 at 13:03

score 0 · Answer 3 · answered Jun 24 '19 at 23:45

0

You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium

answered Jun 24 '19 at 23:45

Andrés Aviña

11
2

hmm, I'm trying to follow along to this tutorial http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/ not sure how it works there – facsasd Jun 25 '19 at 00:00
The problem is specifically with the page you want to scrape, because it has security against scrapers. – Andrés Aviña Jun 25 '19 at 21:17

score 0 · Answer 4 · answered Jun 24 '19 at 23:45

0

Try Seleneum.

Seleneum is a library that allows programs to interact with web pages by taking control of the browser.

Here is an example in an answer to someone else's question.

answered Jun 24 '19 at 23:45

lowtex

707
4
22

hmm, I'm trying to follow along to this tutorial http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/ not sure how it works there – facsasd Jun 25 '19 at 00:00

Data scraping from a webpage with javascript using python

4 Answers4