0

I want to parse some web-pages on Google Play (for example this) to get current version of the game, total downloads, etc. I'm not a newbie in Java, but a little bit newbie in parsing. I heard something about JSOUP library and tried to deal with it, but faced the problem.

Seems like Google Play doesn't provide proper HTML doc (page source code kinda clear). I think initially the page is loading and only after then using JS, the data is loading onto the page. div/span classes have same names, and i got something like this:

<span class="htlgb">December 16, 2019</span>
<span class="htlgb">20M</span>
<span class="htlgb">100,000+</span>
<span class="htlgb">1.5.7</span>
<span class="htlgb">4.0 and up</span>

How to juke this? Any tips? Can i solve it with JSOUP or not?

  • 1
    If page requires JS to load info which you are interested in then probably easier way would be using web-driver like Selenium. Related: [Jsoup Java HTML parser : Executing Javascript events](https://stackoverflow.com/q/7344258) – Pshemo Dec 20 '19 at 18:32

1 Answers1

0

You'll have to just keep your parser up to date with the site. For now, you'll have to assume the first span with that class name is the date, second span is the views, third span is the installs, etc. You can get a list of span elements with the class htlgb and identify them based on their index.

However, if you make some other assumptions you can be more certain. For example, you can know which span is the date because its text will include a month (i.e December).

keyhan
  • 247
  • 1
  • 6
  • Okay, seems good for 1 specific page. But these spans inside div block. And for each page with game div block has different names and same span names. How to automatize the parsing process with it? –  Dec 20 '19 at 19:34
  • @Miroha Oh that is rough. I suggest finding the span with a month, checking that it matches the format for the month span (i.e Month ##, 20##) then grabbing that span's class name. That will be the class name used for the rest of the spans. – keyhan Dec 20 '19 at 19:52