The website you are trying to scrape has a dynamically generated content by JavaScript .
You have two options to work your way around that
Simulate a human browser interaction using selenium and open the website then wait till all the content is rendered and then use selenium to Extract the data you seek . this approach deals with the Elements tab. you just use css or xpath selectors to get the tags you want
instead of finding a way to make selenium go to network tab and save the content ( which you will find extremely hard to do ) you should get the URL of the XHR request and build the same request with the same headers and parameters if any exists and then use requests
to send that request and you can save the content easily .
Let's try to scrape Home | Microsoft Academic
First approach :
from selenium import webdriver
driver = webdriver.Chrome() # Launch the browser
driver.get("https://academic.microsoft.com/home") # Go to the given url
authors = driver.find_elements_by_xpath('//a[@data-appinsights-action="TopAuthorSelected"]') # get the elements using selectors
for author in authors: # loop through them
print(author.text)
Output :
1. Yoshua Bengio
2. Geoffrey E. Hinton
3. Andrew Zisserman
4. Ilya Sutskever
5. Jian Sun
6. Trevor Darrell
7. Scott Shenker
8. Jiawei Han
9. Kaiming He
10. Ross Girshick
11. Ion Stoica
12. Hari Balakrishnan
13. R Core Team
14. Jitendra Malik
15. Jeffrey Dean
Second approach :
import requests
res = requests.get('https://academic.microsoft.com/api/analytics/authors/topauthors?topicId=41008148&take=15&filter=1&dateRange=1').json()
#The XHR Response is Usually in Json format
#res = [{'name': 'Yoshua Bengio', 'id': '161269817', 'lat': 0.0, 'lon': 0.0}, {'name': 'Geoffrey E. Hinton', 'id': '563069026', 'lat': 0.0, 'lon': 0.0}, {'name': 'Andrew Zisserman', 'id': '2469405535', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ilya Sutskever', 'id': '215131072', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jian Sun', 'id': '2200192130', 'lat': 0.0, 'lon': 0.0}, {'name': 'Trevor Darrell', 'id': '2174985400', 'lat': 0.0, 'lon': 0.0}, {'name': 'Scott Shenker', 'id': '719828399', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jiawei Han', 'id': '2121939561', 'lat': 0.0, 'lon': 0.0}, {'name': 'Kaiming He', 'id': '2164292938', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ross Girshick', 'id': '2473549963', 'lat': 0.0, 'lon': 0.0}, {'name': 'Ion Stoica', 'id': '2161479384', 'lat': 0.0, 'lon': 0.0}, {'name': 'Hari Balakrishnan', 'id': '1998464616', 'lat': 0.0, 'lon': 0.0}, {'name': 'R Core Team', 'id': '2976715238', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jitendra Malik', 'id': '2136556746', 'lat': 0.0, 'lon': 0.0}, {'name': 'Jeffrey Dean', 'id': '2429370538', 'lat': 0.0, 'lon': 0.0}]
for author in res:
print(author['name'])
Output:
Yoshua Bengio
Geoffrey E. Hinton
Andrew Zisserman
Ilya Sutskever
Jian Sun
Trevor Darrell
Scott Shenker
Jiawei Han
Kaiming He
Ross Girshick
Ion Stoica
Hari Balakrishnan
R Core Team
Jitendra Malik
Jeffrey Dean
Second approach saves time , resources and straight forward .
Using First approach Image
Using Second approach Image