Url request does not parse every information in HTML using Python

Question

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:

data = requests.get(url,time.sleep(15)).text

I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.

The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?

I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!

score 0 · Answer 1 · answered Dec 11 '20 at 02:33

There could be a few reasons for this.

The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.

You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.

Code (taken from here)

from selenium import webdriver

browser = webdriver.Firefox()
browser.get("http://example.com")

html_source = browser.page_source

With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively, once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

Url request does not parse every information in HTML using Python

1 Answers1