-1

I need to scrape data from a webpage that use javascript encryption to protect data, so if I visit the page from my browser and I look at the source code I'm not able to see the data, but if I do "Analyze element" in Mozilla or I download the page on my computer and then examine the source code, I see the needed data not encrypted.

At the moment I'm using:

import requests
source = requests.get(url).text

but I receive the unparsed source code

I've also tried to use wget:

import wget
source = wget.download(url)

but it downloads an "Access denied Cloudflare" page.

How can I access the parsed source code in Python?

Hyperion
  • 2,515
  • 11
  • 37
  • 59
  • 1
    They're not encrypted, the content you get from `requests.get()` is the actual source of that specific URL and the difference is, modern browsers parse javascript content and then do as told (populate tables, load extra data etc. ) to render into a final page for you. – Shane Feb 02 '17 at 10:05
  • @Shane Thank you for the clarification, then Is it possible to obtain the parsed source code? – Hyperion Feb 02 '17 at 10:21
  • Yeah sure, you just need to find out how those "behind the scenes" requests work in an actual browser such as Chrome, and then simulate those requests. – Shane Feb 02 '17 at 10:29
  • @Shane, that's what a headless browser does. –  Feb 02 '17 at 10:32
  • @MartinBroadhurst: It's not just what a headless browser does, it's what **ALL modern browsers** do. – Shane Feb 02 '17 at 10:36
  • @Shane, exactly, so that's why Hyperion can use a headless browser to do what he wants. –  Feb 02 '17 at 10:46

1 Answers1

2

If a page is rendered by JavaScript, you need to use a headless browser like PhantomJS to download it and access the document structure. A headless browser will run the JavaScript on the page and create the document by fetching external data, populating tables, etc., just like a real browser.

Here is an example of a PhantomJS program downloading a page and getting the document title:

var page = require('webpage').create();
page.open(url, function(status) {
    var title = page.evaluate(function() {
        return document.title;
    });
    console.log('Page title is ' + title);
    phantom.exit();
});
  • I've tried to download a webpage using the code provided in the first answer [here](http://stackoverflow.com/questions/16856036/save-html-output-of-page-after-execution-of-the-pages-javascript), but as result the source code of the downloaded page is still unparsed. How do I get the final code in any accessible form? – Hyperion Feb 02 '17 at 10:49
  • Did you write the page content in an `onLoadFinished` handler? –  Feb 02 '17 at 11:01
  • I just copied 1:1 the code in the answer and replaced it with my url. I'm not really familiar with JS – Hyperion Feb 02 '17 at 11:02
  • Try rendering an png image in the same place and see if it looks right. If it doesn't contain the dynamic content then you're doing it at the wrong time in the process. –  Feb 02 '17 at 11:06
  • Just checked, my code is using "page.onLoadFinished = function() {" that should wait the page fully loaded. But still I get the non rendered code... – Hyperion Feb 02 '17 at 11:30