0

I am trying to get at HTML data that does not appear in the source document but can be exposed, for example, by "inspect element" in Google Chrome.

Example page: http://assignment.uspto.gov/#/search?q=9000000&sort=patAssignorEarliestExDate%20desc%2C%20id%20desc&synonyms=false

There are a number of div elements containing assignment data for U.S. Patent No. 9,000,000 that appear below the line

<script async="async" type="text/javascript" src="https://components.uspto.gov/js/ais/2-2-assignment-search.js"></script>

Is there a way to extract this hidden html with Jsoup?

PatentWookiee
  • 187
  • 2
  • 17
  • 1
    I think there is a possibility using Selenium. Jsoup does not support javascript – Yassin Hajaj Nov 24 '15 at 18:23
  • 1
    If there is ajax call to get data, that means data exposed through HTTP or REST API. You can use plain HTTP call or apache HTTPClient to get the data. No need to use jsoup to process. – hutingung Nov 25 '15 at 01:02

2 Answers2

1

The data seems to loaded with AJAX. JSoup does not process Javascript.

What you need is a "headless browser" API, that processes Javascript without actually rendering anything.

HtmlUnit seems to be the best known tool, although I've never used it myself. As suggested before, Selenium Webdriver is also an option.

I believe you will have to load the URL, wait for all the AJAX to process, and you will eventually get almost the same parse tree you get in Chrome in Java to do with it as you wish!

N K
  • 401
  • 3
  • 14