0

For some reason, when parsing https://touch.facebook.com/messages

and using getElementById("threadlist_rows"), I'm receiving null even though the id definitely exists.

It's within the html() output, it works on try jsoup online, and I can get other ids such as root without issues.

The same method has been working for other pages, and the only difference I can see is that there are a lot of other elements with ids that have a prefix of threadlist_row_ within the body.

Various other selectors such as getElementsByClass("aclb") also don't seem to be returning a full list for this page.

Can anyone shed any light on this issue?

Allan W
  • 2,791
  • 4
  • 23
  • 41
  • 1
    It won't work with Jsoup because Jsoup parses only static HTML. The element with ID you mentioned is created dynamically (JavaScript) and Jsoup does not execute any JavaScript code. Check PhantomJS library (https://stackoverflow.com/a/39369662/2194470) – Szymon Stepniak Jul 14 '17 at 06:44
  • @SzymonStepniak but shouldn't it work if the html method does indeed show that the id is already there? The element isn't being added after loading. I'll have to check for the aclb elements because that might be a different case. – Allan W Jul 14 '17 at 06:55
  • No, this element is not present when Jsoup opens the website. It shows up in the web browser, because web browser interprets JavaScript code that creates this element. Jsoup does not interpret any JavaScript, so this element never shows for Jsoup. If you need a solution that interprets JavaScript you will have to take a closer look to things like e.g. PhantomJS (it interprets JavaScript) – Szymon Stepniak Jul 14 '17 at 07:03
  • @SzymonStepniak thanks. Just disabled javascript on my chrome and saw that this is the case. If you'd like to post an answer I'll accept it – Allan W Jul 14 '17 at 07:08
  • Done. Take a look at what I've found - combining PhantomJS with Jsoup. This might be the best way to do it in your case. Good luck! – Szymon Stepniak Jul 14 '17 at 08:13

1 Answers1

1

In your case Jsoup won't work as you expected. The element with id threadlist_rows is rendered by JavaScript function after loading the page. Jsoup does not interpret any JavaScript code - it works only with static HTML. You can "simulate" what Jsoup sees by investigating page source (Ctrl + U) or temporarily turning off JavaScript in your web browser.

Consider using alternatives, like:

  • PhantomJS (http://phantomjs.org/), a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

You can also try doing thing mentioned in this topic: https://stackoverflow.com/a/39174441/2194470 (downloading HTML with PhantomJS and manipulating DOM structure with Jsoup). This could be the best option for you. Hope it helps :)

Szymon Stepniak
  • 40,216
  • 10
  • 104
  • 131
  • Great. For those with the same issue, due to Webviews optimizations with Android, I've decided to go with that, even though it isn't headless. One can also override the resource loading to ignore all images and css to improve some loading time, as those won't be necessary for an invisible view. The JS will be handled with watchers injected to the webview. – Allan W Jul 14 '17 at 19:27
  • Awesome, Allan! Good luck with your project :) – Szymon Stepniak Jul 15 '17 at 09:49