-2

I am used to using BeautifulSoup to scrape a website, however this website is different. Upon soup.prettify() I get back Javascript code, lots of stuff. I want to scrape this website for the data on the actual website (company name, telephone number etc). Is there a way of scraping these scripts such as Main.js to retrieve the data that is displayed on the website to me?

Clear version:

Code is:

<script src="/docs/Main.js" type="text/javascript" language="javascript"></script>

This holds the text that is on the website. I would like to scrape this text however it is populated using JS not HTML (which I used to use BeautifulSoup for).

Tom Pitts
  • 590
  • 7
  • 25
  • Can you please be more clear what are you trying to do ? – Abhishake Gupta Sep 07 '16 at 18:39
  • Possible duplicate of [Web-scraping JavaScript page with Python](http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – Alexander O'Mara Sep 07 '16 at 18:39
  • @AlexanderO'Mara Sorry, updated it – Tom Pitts Sep 07 '16 at 18:42
  • Are you asking how to access the `/docs/Main.js` file? – Soviut Sep 07 '16 at 18:44
  • @Soviut I'm asking is there a way to access the information on the page. The Main.js file is stored on their server, but it populates the website with text. So how can I scrape that text? Usually I just search for the tag in HTML and then print the content. – Tom Pitts Sep 07 '16 at 18:46
  • You can use tools like `Selenium` to control browser - open page, run javascript and then get data from browser. Or you can manually analyze communication between browser and server because mostly javascript read data from server so you can manually find url with this data and then read it using BS. – furas Sep 07 '16 at 18:50
  • @furas How would I manually find the URL with data? Any tutorials you know of? Thanks :) – Tom Pitts Sep 07 '16 at 18:52
  • You will need to use a headless browser in order to run the javascript that generates the text. Normal HTTP requests won't do this. – Soviut Sep 07 '16 at 18:52
  • @Soviut So you would suggest running a headless browser and then analysing that with BeautifulSoup? – Tom Pitts Sep 07 '16 at 18:53
  • @TomPitts: BeautifulSoup only downloads the HTML. It can parse it and let you extract elements, but it completely ignores JavaScript code that can modify the page (and in more modern webapps, entirely generate the page). In general, the easiest way would be to use a headless browser, since it will run the JavaScript just as your browser would. You can use it to render the page completely and then have BeautifulSoup parse the resulting HTML. The alternative would be to disassemble the JavaScript and figure out where it gets the data from. It's harder to do, but runs faster. – Blender Sep 07 '16 at 18:58
  • I use `Developer Tools` in Chrome or Firefox - there is tab "Network" which show all files/data send from server to browser. JavaScript use AJAX to get data and you can choose files send by XHR (AJAX). This way you can find url with your data (mostly send as JSON so you can easly convert it to Python dictionary) – furas Sep 07 '16 at 19:00
  • @Blender Thanks for your help, I'll go and try this. I am correct in stating though that I can use a Headless browser to generate the website and then get BeautifulSoup to analyse that information? – Tom Pitts Sep 07 '16 at 19:03

1 Answers1

1

You're asking if you can scrape text generated at runtime by Javascript. The answer is sort-of.

You'd need to run some kind of headless browser, like PhantomJS, in order to let the Javascript execute and populate the page. You'd then need to feed the HTML that the headless browser generates to BeautifulSoup in order to parse it.

Soviut
  • 88,194
  • 49
  • 192
  • 260