Scrape JSON response with Selenium Browser

Question

I want to make the following actions automatically:

I open a web page with Google Chrome.
wait for it to render all the needed information.
go to Inspect Element, tab Network, and look at XHR requests.
find the file that I need.
copy the content of its response (to save it in a txt file).

It's kind of web scraping, but with less effort (how I think).

The problem is that I can't find what tools allow me to do that.

I started with Python and Selenium (chrome driver). But didn't found any info, is it possible to get XHR responses or not. All the tutorials are about scraping HTML. It seems logic to be possible, but my research didn't help.

Any idea?

Thank you.

BrowserMob proxy will help you with this. https://stackoverflow.com/questions/17082425/running-selenium-webdriver-with-a-proxy-in-python — DMart, Dec 04 '19 at 20:57
Using just Selenium, you can't get the XHR responses from network. But there are extensions that may help with this -- @DMart mentioned BrowserMob, which I have heard is quite popular. — CEH, Dec 04 '19 at 23:16
You can monkey patch xhr to keep track of those, the tricky part is doing that before the xhr is sent. Puppeteer has built-in support for this so I would switch. — pguardiario, Dec 05 '19 at 01:09

score 1 · Accepted Answer · edited May 13 '21 at 14:47

The website you are trying to scrape has a dynamically generated content by JavaScript .

You have two options to work your way around that

Simulate a human browser interaction using selenium and open the website then wait till all the content is rendered and then use selenium to Extract the data you seek . this approach deals with the Elements tab. you just use css or xpath selectors to get the tags you want
instead of finding a way to make selenium go to network tab and save the content ( which you will find extremely hard to do ) you should get the URL of the XHR request and build the same request with the same headers and parameters if any exists and then use requests to send that request and you can save the content easily .

Let's try to scrape Home | Microsoft Academic

First approach :

from selenium import webdriver

driver = webdriver.Chrome() # Launch the browser 
driver.get("https://academic.microsoft.com/home") # Go to the given url
authors = driver.find_elements_by_xpath('//a[@data-appinsights-action="TopAuthorSelected"]') # get the elements using selectors
for author in authors: # loop through them 
    print(author.text)

Output :

1. Yoshua Bengio
2. Geoffrey E. Hinton
3. Andrew Zisserman
4. Ilya Sutskever
5. Jian Sun
6. Trevor Darrell
7. Scott Shenker
8. Jiawei Han
9. Kaiming He
10. Ross Girshick
11. Ion Stoica
12. Hari Balakrishnan
13. R Core Team
14. Jitendra Malik
15. Jeffrey Dean

Second approach :

import requests 
res = requests.get('https://academic.microsoft.com/api/analytics/authors/topauthors?topicId=41008148&take=15&filter=1&dateRange=1').json()
#The XHR Response is Usually in Json format
#res = [{'name': 'Yoshua Bengio', 'id': '161269817', 'lat': 0.0, 'lon': 0.0}, {'name': 'Geoffrey E. Hinton', 'id': '563069026', 'lat': 0.0, 'lon': 0.0}, {'name': 'Andrew Zisserman', 'id': '2469405535', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ilya Sutskever', 'id': '215131072', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jian Sun', 'id': '2200192130', 'lat': 0.0, 'lon': 0.0}, {'name': 'Trevor Darrell', 'id': '2174985400', 'lat': 0.0, 'lon': 0.0}, {'name': 'Scott Shenker', 'id': '719828399', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jiawei Han', 'id': '2121939561', 'lat': 0.0, 'lon': 0.0}, {'name': 'Kaiming He', 'id': '2164292938', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ross Girshick', 'id': '2473549963', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ion Stoica', 'id': '2161479384', 'lat': 0.0, 'lon': 0.0}, {'name': 'Hari Balakrishnan', 'id': '1998464616', 'lat': 0.0, 'lon': 0.0}, {'name': 'R Core Team', 'id': '2976715238', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jitendra Malik', 'id': '2136556746', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jeffrey Dean', 'id': '2429370538', 'lat': 0.0, 'lon': 0.0}]
for author in res:
    print(author['name'])

Output:

Yoshua Bengio
Geoffrey E. Hinton
Andrew Zisserman
Ilya Sutskever
Jian Sun
Trevor Darrell
Scott Shenker
Jiawei Han
Kaiming He
Ross Girshick
Ion Stoica
Hari Balakrishnan
R Core Team
Jitendra Malik
Jeffrey Dean

Second approach saves time , resources and straight forward .

Using First approach Image

Using Second approach Image

Thank you for the answer. The first means scraping HTML, yes?! Ok. — TheVic, Dec 05 '19 at 12:48
The second is not possible because XHR request contains many cookies and different CSRF data that is not possible to simulate. — TheVic, Dec 05 '19 at 12:49
It's weird that there is no existing solution for my idea. It can be very useful, and seems to be easy to realize for an experienced developer. And it can move the web scraping to next level :D. Anyway, thank you for the answer. — TheVic, Dec 05 '19 at 12:51
Selenium doesn't "simulate" a browser, it actually launches a browser instance. — DMart, Dec 05 '19 at 16:06
You don't show how to access the xhr response, and therefore this is not an answer either. — pguardiario, Dec 06 '19 at 12:17

score -1 · Answer 2 · answered Dec 05 '19 at 16:09

-1

BrowserMob proxy (https://github.com/lightbody/browsermob-proxy) will help you with this. It will capture all requests, and when configured, their responses.

See this previous answer for more details: Running Selenium Webdriver with a proxy in Python

answered Dec 05 '19 at 16:09

DMart

2,401
1
14
19

1

I don't see how is this answer any relevant to the question ? – Ahmed Soliman Dec 06 '19 at 00:34
1

I kind of agree, at least show how to access the xhr response. This is a non-answer. – pguardiario Dec 06 '19 at 12:16

Scrape JSON response with Selenium Browser

2 Answers2