0

How can I scrape a website with dynamic content loading, like a forbes.com article, but without using web-driver (it's slow) in apache http client.

I've tried getting the sitemap.xml but their sitemap includes only the latest articles and I want info from very old articles.

Also, I want a more generic solution and with the web-driver (I use selenium with phantomJS now) is site-specific and slow.

Stephan
  • 41,764
  • 65
  • 238
  • 329
  • 4
    Load the page you want in desktop browser, look at the network tab of developer tools to see where the actual content being loaded from. Very often such dynamic JavaScript sites load their content from some URL, eg in Json format. Then all you need to do is figure out how you can load data from the same URL in your own code. – Jonas Czech Jan 06 '16 at 15:21
  • Possible duplicate of [headless internet browser?](http://stackoverflow.com/questions/814757/headless-internet-browser) – Stephan Jan 07 '16 at 10:54
  • @Stephan I don't think it's a duplicate, since I clearly mention that I am looking for a different solution than using a web driver with headless (or not) browser. – theol.zacharopoulos Jan 07 '16 at 12:45

1 Answers1

0

I'd suggest you to try this tool ui4j. It's a wrapper around the JavaFx WebKit Engine with headless modes. It can help you speeding up things.

Stephan
  • 41,764
  • 65
  • 238
  • 329
  • ui4j has one major drawback: It uses the JavaFX integrated webkit engine and this engine is incompatible with certain web pages, especially when it comes to JavaScript and CSS. – Robert Mar 20 '17 at 15:10