8

I have looked around and only found solutions that render a URL to HTML. However I need a way to be able to render a webpage (That I already have, and that has JavaScript) to proper HTML.

Want: Webpage (with JavaScript) ---> HTML

Not: URL --> Webpage (with JavaScript) ---> HTML

I couldn't figure out how to make the other code work the way I wanted.

This is the code I was using that renders URLs: http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

For clarity, the code above takes a URL of a webpage that has some parts of the page rendered by JavaScript, so if I scrape the page normally using say urllib2 then I won't get all the links etc that are rendered as after the JavaScript.

However I want to be able to scrape a page, say again with urllib2, and then render that page and get the outcome HTML. (Different to the above code since it takes a URL as it's argument.

Any help is appreciated, thanks guys :)

user3928006
  • 83
  • 1
  • 1
  • 8
  • I find what you want unclear. Perhaps you can give an example of what you mean by "render a webpage to proper HTML". Do you want the actual DOM? Do you want the textual HTML? Rendering can be done when you "feed the webpage into a browser" (i.e., open this text file with a browser), so it's not clear what else you want to achieve that is not already done by the browser. – barak manos Apr 02 '15 at 04:20
  • Now that you've made it clearer - I would go with Selenium Web Driver. Have you considered that? If you give a more concrete example of your `urllib2` code, then I might be able to refer to it with a corresponding Selenium code. – barak manos Apr 02 '15 at 04:36
  • Now it's completely unclear what it is that you want: "I want this part but in a way like the first example" - But the first example doesn't do any of that. It just says in a comment "I want to render text and get the pure HTML". So do you want to render the URL or not??? What difference does it make if you first fetch the data from the URL into a file using `urllib2`? In either case you have to send an HTTP request at some point. You can take the text file and feed it into Selenium (or any other scraping utility), but it's not going to be any different than using the URL directly. – barak manos Apr 02 '15 at 04:56
  • The URL is protected by cloudflare and I don't know how to fetch the bypassed url because it gives me the cloud flare block page if I fetch the URL directly. I have a way to get the bypassed HTML however – user3928006 Apr 02 '15 at 05:08
  • So you can fetch it **only** with`urllib2`? How is that possible??? – barak manos Apr 02 '15 at 05:16
  • I tried to give a simplified example of what i was trying to do buy that obviouly failed. I fetch with cfscraper using the .get(URL) method which acts like urllibs .get but bypasses the page – user3928006 Apr 02 '15 at 05:21
  • And then you want to render that data in a browser? – barak manos Apr 02 '15 at 05:24
  • The page has cloud flare protection and some JavaScript generated URLs. I want to fetch the URLs so I need to bypass cloud flare then execute the JavaScript and get the HTMLof that so I can fetch those URLs. – user3928006 Apr 02 '15 at 05:29
  • Have you ever tried to open a local HTML file in a browser? The URL line looks something like (for example) "file:///C:/Users/Desktop/test.html". I bet that if you use that as the argument which you pass to `Render`, then it will get you the result you desire. If not, then I'm sure that Selenium **will** be able to handle it properly. The file type doesn't have to be `html` of course, just make sure that you pass the correct file path. – barak manos Apr 02 '15 at 05:30
  • So write it to a HTML file and pass that as a local HTML to the code from the link? Or is that still over complicating? – user3928006 Apr 02 '15 at 05:34
  • Give it a try with `Render`. If it doesn't work, then give it a try with `Selenium` instead. – barak manos Apr 02 '15 at 05:35
  • Just gave it a try myself, Selenium works for sure. – barak manos Apr 02 '15 at 05:41
  • Could you provide a simple example? I just got home I'll install selenium now and give it a try. – user3928006 Apr 02 '15 at 05:57

3 Answers3

13

You can pip install selenium from a command line, and then run something like:

from selenium import webdriver
from urllib2 import urlopen

url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'

conn = urlopen(url)
data = conn.read()
conn.close()

file = open(file_name,'wt')
file.write(data)
file.close()

browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
barak manos
  • 29,648
  • 10
  • 62
  • 114
  • I hit another problem however, is there somewhere more convenient I could ask you about it? – user3928006 Apr 02 '15 at 07:25
  • @user3928006: Post it in another question. You'll be asking not just me, but the entire community (so you'll have better chances of getting a good answer). You can link it in a comment to this question if you my specific attention to it at that point. – barak manos Apr 02 '15 at 07:27
  • It's quite relevant to this question, something in the rendered page isn't rendering how I would expect, I'll update this question with my edited version of your code – user3928006 Apr 02 '15 at 07:29
  • 2
    @user3928006: No, don't do it this way, it will make the answer obsolete and partially irrelevant. This is not how things are usually done here. If your new problem is related to this question (or to the answer), then link it **within the new question that you post**. – barak manos Apr 02 '15 at 07:31
  • Oh. whoops, I edited it already :/ (Thanks for the future tip tho I guess,,,) – user3928006 Apr 02 '15 at 07:36
  • @user3928006: Yes well, you just took my answer and copied into your question (without even specifying it, BTW). It ruins the entire point of 'question & answer' in this post. Please publish it as a separate question! – barak manos Apr 02 '15 at 07:43
  • 1
    Not so simple. This requires having both the `Firefox` browser and the `geckodriver` installed. – Leonid Jun 10 '20 at 16:56
4

The module I use for doing so is request_html. The first time used it automatically downloads a chromium browser, then you can render any webpage(with JavaScript)

requests_html also supports html parsing.

basically an alternative for selenium

example:

from requests_html import HTMLSession

session = HTMLSession()

r = session.get(URL)

r.html.render() # you can use r.html.render(sleep=1) if you want


-1

try webdriver.Firefox().get('url')