python javascript scrape automatically

Question

Python novice here.

I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried

pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)

and

requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")

and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.

If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?

Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.

Please excuse if this question is poorly worded, and thank you very much in advance

Tusen takk!

Edit: I have also tried it using BeautifulSoup, as @pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.

Have you looked at [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)? — pmkro, Feb 23 '18 at 21:30
You could try using http://selenium-python.readthedocs.io/. This product makes it possible to load pages in a browser and then to locate buttons and press them. — Bill Bell, Feb 23 '18 at 21:38
Thank you very much for the link, @BillBell, I did not know about Selenium. Seems like this is getting a bit more complicated than I had hoped, but well...maybe that speeds up the learning process :D — I_love_Norway, Feb 23 '18 at 21:48
You're welcome. If you get stuck there are lots of people here that can help. — Bill Bell, Feb 23 '18 at 21:55

score 0 · Accepted Answer · answered Feb 23 '18 at 21:51

0

I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.

You can pip install selenium from a command line, and then run something like:

from selenium import webdriver
from urllib2 import urlopen

url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'

conn = urlopen(url)
data = conn.read()
conn.close()

file = open(file_name,'wt')
file.write(data)
file.close()

browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()

I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.

The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.

i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data

The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.

answered Feb 23 '18 at 21:51

phospodka

988
1
5
12

Thank you very much, the API-thing especially seems interesting. I'd need something like this https://sa-tb.nl/api/widget/overview/survey/4/company/793/category/67?sort=questionText-asc&group=&filter= , but although in the developer tabs under "Response" it seems to return something, when I manually enter that URL an error occurs. – I_love_Norway Feb 23 '18 at 22:46
@I_love_Norway looks like they are passing a header to tell the browser to only accept requests from their domain. It's the header `Access-Control-Allow-Origin:https://www.transparantiebenchmark.nl` If you use something like curl, wget, or anything else to just fetch the data you should be find. i.e. `curl -XGET 'https://sa-tb.nl/api/widget/overview/survey/4/company/793?sort=&group=&filter='` – phospodka Feb 26 '18 at 16:16
Here is a [link](https://stackoverflow.com/a/10636765/3884529) explaining that `Access-Control-Allow-Origin` header . – phospodka Feb 26 '18 at 16:25
Thank you for the additional insights. Yesterday morning I got it working using python's requests module and actually pasting in the entire header-information there. But good to know which parameter in particular was important, and thanks to the link you provided I can now actually try to understand how these things work. – I_love_Norway Feb 27 '18 at 17:53
@I_love_Norway Welcome! It surprised me that header (or something similar) seems to come into play when hitting the URL directly in a browser. Curl at least worked without any extra header details for me, so it seems likely it is something the browser is handling. If this worked out, feel free to accept the answer. I'll still watch for any comments you have. – phospodka Feb 27 '18 at 19:50

python javascript scrape automatically

1 Answers1