0

I'm writing a scraper. When I use inspect element in chrome I see the following:

code visible in chrome inspect element

but when I run my code Elements data = doc.select("div.item-header"); and I print the object data I see that the object has the following chunk of html in it:

<div class="item-header"> 
 <h1 class="text size-20">Snake print bell sleeves top</h1> 
 <div class="text size-12 muted brandname ma_top5">
       <!-- data here is irrelevant --> 
 </div> 
</div>

So, what I can't figure out is, why does my code get a different html than that visible in chrome's inspect element? What am I missing here?

I'm using java, the library is Jsoup. Any help is greatly appreciated.

Talha
  • 55
  • 1
  • 7
  • 2
    Is it possible that the HTML you are trying to scrape has more than one `div` with the class `item-header`? – dave Jan 25 '20 at 22:47
  • 2
    It's also possible that the data is loaded after the DOM is loaded (ajax?). – PRiM Jan 25 '20 at 22:48
  • it may possible that your site has code to muted Html while Chrom is open, hence you are getting different result https://stackoverflow.com/questions/7798748/find-out-whether-chrome-console-is-open – divyang4481 Jan 25 '20 at 22:48
  • FYI it's __scraper__ (and __scraping__, __scraped__, __scrape__) not scrapper. A scrapper is someone who throws things away, i.e. pretty much the opposite of what you want to do. – DisappointedByUnaccountableMod May 05 '21 at 08:21

1 Answers1

1

Websites consist of HTML and JavaScript code. Often that JavaScript is executed when the page is loaded and it's possible that the source of a page is modified or some additional content is loaded by asynchronous AJAX calls. Jsoup can't parse Javascript so it can only parse the original HTML document.
Don't use Chrome's Inspect option as it presents HTML after possible transformations. Use View source (CTRL+U). This way you'll see original HTML source unmodified by JavaScript (you can also try reloading the page with JavaScript disabled). And that original source is what gets downloaded and parsed by Jsoup.
If that's the case and you really want to parse the data that's loaded by JavaScript try to observe XHR requests in Chrome's Network tab. You can check this answer to see what I mean: How to Load Entire Contents of HTML - Jsoup

Krystian G
  • 2,842
  • 3
  • 11
  • 25