1

I am trying to collect data from a web page displaying search results about cars on sale.

The structure of the online document is not too complex and I was able to single-out the interesting bits of information based on a certain attribute data-testid which every returned car record possesses.

data-testid

I can find different interesting bits of information like price, immatriculation year, mileage and so on, based on substring characteristics of this attribute.

I use beautifulsoup to parse the HTML and requests to initially load the HTML document from the web.

Now, here's the issue. In a way that I cannot predict, nor find a logic to, the HTML returned by requests.get() is somehow incomplete. In a page of 100 results, which I can see when I inspect the page online (and I can track there are 100 data-testid fields with that specific substring for price, 100 for mileage and so on...), the HTML returned by requests.get(), in the same way as the one I can obtain with a 'save-as' operation from the page itself, only contains a portion of these fields.

Also their number is kind of unpredictable.

I started asking just why this discrepancy between online and saved HTML. So far no full response, but in the comments the hint was that the page was kinda loading dynamically through JavaScript. I was happy to find that saving the page to disk, with all the files, somehow produced the full HTML I could then parse without further issues.

However, my joy only lasted for that specific search. When I needed to try a new one, I was suddenly back to square one. With further investigation, I came to my current understanding, which is at the origin of the question: I noticed that, when I save the online page as 'Webpage, Complete' (which creates an .html file plus a folder), this combo surely contains ALL records. I can say that because if I go offline and double-click on the newly saved html, I can see all records which were online (100 in this case). However, the HTML file itself only contains a few of them!!!

My deduction is, therefore, that the rest of the records must be 'hidden' in the folder created at saving time, and I would tend to say it could be embedded in those (many) *.js.download files:

enter image description here

My questions are:

  • is my assumption correct? The other records are stored in those files?
  • if yes, how can I make them 'explicit' when parsing the HTML with beautifulsoup?

UPDATE 07/05

I've tried to install and use requests_html as suggested in the comments and in this answer. Its render() method looked promising, however I'm probably not really understanding the mechanisms explained in requests_html documentation here (the render JS portion) because even after the following operations (pseudo-code)

from requests_html import HTMLSession
session = HTMLSession()
r = session.get(URL)
r.html.render()

At this point, I was hoping to have 'forced' the site to 'spit out' ALL HTML, including those bits which remain somehow hidden and only show up in the live page. However, a successive dump of the r.html.html into a file, still gives back the same old 5 records (why 5 now, when for other searches it returned 12, or even 60, is a complete mystery to me).

Michele Ancis
  • 1,265
  • 4
  • 16
  • 29
  • Does this answer your question? [How to call JavaScript function using BeautifulSoup and Python](https://stackoverflow.com/questions/48603339/how-to-call-javascript-function-using-beautifulsoup-and-python) – Chillie Jul 05 '21 at 09:27
  • I'll be able to tell you once I understand what a 'headless browser' is and how to use it in this context :) Answer number 4, using `requests-HTML` instead of `request` also gives some hope. Thanks for the pointer! – Michele Ancis Jul 05 '21 at 18:08

0 Answers0