-1

What's the best way to pass HTML to Java?
Specifically, I need to crawl through 2TB of HTML files (.warc format, using nutchWAX) and feed them to my java program one at a time.

Workflow:

  • crawl a page
  • send page to java program
  • wait for answer and then continue crawling

Question: Do I create a script to escape all special characters in HTML and then pass it on as an argument, do I write it to a file and pass the path of the file or is there a better way (bear in mind, 2TB of data)?

Jaan Susi
  • 47
  • 1
  • 6

2 Answers2

1

I think you should look for html parsers from this page :

Comparison of HTML parsers

Creating a script might not be a good idea. You may have inline css, javascript, escape quotes already. It will be a huge amount of pain to do this correctly.Previously, I had tried writing a script but found it cumbersome.Finally, I tried with html parsers and it worked like a charm!

swapyonubuntu
  • 1,952
  • 3
  • 25
  • 34
0

You should do it with Jsoup.

http://jsoup.org/

With it, you can easily extract the data you want, such as URL's or links using a simple API, and you can feed them into your program. It can also be used in a multithreaded environment, and is also quite fast.

Check this answer also, it will be very helpful.

For a comparison of Java HTML parsers, go here.

For your question:

Do I create a script to escape all special characters in HTML and then pass it on as an argument.

Jsoup does this for you. If all you want is the text of the HTML document, you might want to use a regex instead, though.

do I write it to a file and pass the path of the file or is there a better way

Yes, you could pass it to your program as a string. Writing 2tb of files would be very ineficient.

Note that whatever you do, processing 2000gb oh HTML is going to take a loooong time!

Hope this helps.

Community
  • 1
  • 1
Jonas Czech
  • 12,018
  • 6
  • 44
  • 65