0

With the goal of extracting information in more readable format out of a search result on a web site, I am now very puzzled by what I am seeing.

I access the result page via the 'inspect' feature of Chrome:

Chrome RMB Menu

To get a split pane where every element in the page rendering is reachable as it's HTML counterpart:

Split Pane

Now, I am interested in parsing specific tags with an attribute that has a "entry-price" substring in it.

Entry Price

As you can understand, every record of the cars found, has a price <span> element in it, with the price info embedded in it. I am making the case for the price, it is very similar for other properties of each and every record returned by the search.

This specific page has 86 results, and the <span> elements with that specific data-testid attribute value are also 86, at least in this view:

86 tags

The 'interesting' thing is that when I saved the HTML of the page I could see far less tags with those characteristics above: actually, only 5. To reduce the margin for error, I just used the function 'view source' for the HTML page.

There, to my great surprise, a simple text search on 'entry-price' only returns 5 items!!

Only 5 items

Here's the full link, if you want to try for yourself: https://www.willhaben.at/iad/gebrauchtwagen/auto/gebrauchtwagenboerse?ENGINE/FUEL=100001&WHEEL_DRIVE=3&EQUIPMENT=11&sort=3&CAR_MODEL/MAKE=1042&sfId=e7ce8b54-db41-419b-a7e2-edd5f23501eb&isNavigation=true&CAR_MODEL/MODEL=1774&rows=100&page=2&YEAR_MODEL_FROM=2018&YEAR_MODEL_TO=2021

(it's actually the 2nd page of a 186 total results, 100 results per page)

I'm wondering how is that possible? I cannot understand it at all.

BTW, the reason I tried by visualizing the source is that I had a small python script in place - using BeautifulSoup - to parse the saved HTML and extract what I needed. It worked just fine with another search, but this one is giving me extra headaches.

Michele Ancis
  • 1,265
  • 4
  • 16
  • 29
  • 1
    It might be because some contents are loaded with javascript and it might take longer to loaf them, so when your python wrb scraper loads the page, those contents are not there. But when you visit the page, javascript has the necessary time to fetch all the data – Gogu Gogugogu Jul 03 '21 at 20:45
  • I wasn't clear in my explanation. I have on one hand the 'inspect' view of the page, which is fast interacting and has zero issues in counting the elements properly, and a 'view source' view which is slow, cumbersome and does not count the elements properly. Also, a simple 'save-as' of the page should give the full info contained on the page. But it does not. Also there, the records are misssing. – Michele Ancis Jul 03 '21 at 20:51
  • 1
    The 'view source' or saving the page locally only shows the source code. Think about it like a blank page. When you load the page in your browser, javascript populates the blank page with 86 cars. But this is only visible through "inspect" since it doesn't show the source code but the current HTML on the page – Gogu Gogugogu Jul 03 '21 at 21:04
  • Besides the fact that I don't understand how this could work for the previous page, with 100 elements in it and no problems, and another search with ~65 elements, also no probs, my question would be, then: how do I come to save to disk, or request (I've tried the `requests.get()` method in `python`, I get the same result with only 5 records), the FULL set of data? Meaning, the one page that I truly see on screen as a search result? Because when I visit the page, I always get the full set. Why don't I get that, when I visit through a `requests.get()` call? – Michele Ancis Jul 03 '21 at 21:12
  • What I use to do is Opening the page in a browser, opening the network section, close to console, and there you should be able to identify the url used to fetch the cars data. then you should make your request to the same url. If necessary, set some request headers, like origin and referrer, to make the request look like it's coming from that website – Gogu Gogugogu Jul 03 '21 at 21:25
  • OK... I've made a little step forward: if I save the HTML file as 'Webpage, Complete' instead of 'Webpage, HTML only', then the HTML file is complete with all the records displayed online. With this, at least I can parse the saved file! It remains a mystery to me why the `requests.get()` method doesn't yield the correct file... – Michele Ancis Jul 03 '21 at 21:25

0 Answers0