I'd want to parse a large text file formatted in Warc version 0.9. A sample of such text is here. If you take a look at it, you'll find the whole document consists of a list of following entries.
[Warc Headers]
[HTTP Headers]
[HTML Content]
I need to extract URL and HTML content from each entry (please note that the sample file consists of multiple page entries each of which is formatted like the content above.)
I used the following regular expression in Java:
Pattern.compile("warc/0\\.9\\s\\d+\\sresponse\\s(\\S+)\\s.*\n\n.*\n\n(.*)\n\n", Pattern.DOTALL)
Where group 1 and 2 represents the URL and the HTML content respectively. There's two problem with this code:
- It's very slow to find a match.
- Only matches with the first page.
Java Codes:
if(mStreamScanner.findWithinHorizon(PAGE_ENTRY, 0) == null){
return null;
} else {
MatchResult result = mStreamScanner.match();
return new WarcPageEntry(result.group(1), result.group(2));
}
Questions:
- Why is my code only parsing the first page entry?
- Is there a faster way to parse a large text in a streaming manner?