What's the best way to pass HTML to Java?
Specifically, I need to crawl through 2TB of HTML files (.warc format, using nutchWAX) and feed them to my java program one at a time.
Workflow:
- crawl a page
- send page to java program
- wait for answer and then continue crawling
Question: Do I create a script to escape all special characters in HTML and then pass it on as an argument, do I write it to a file and pass the path of the file or is there a better way (bear in mind, 2TB of data)?