3

I'd want to parse a large text file formatted in Warc version 0.9. A sample of such text is here. If you take a look at it, you'll find the whole document consists of a list of following entries.

[Warc Headers]

[HTTP Headers]

[HTML Content]

I need to extract URL and HTML content from each entry (please note that the sample file consists of multiple page entries each of which is formatted like the content above.)

I used the following regular expression in Java:

Pattern.compile("warc/0\\.9\\s\\d+\\sresponse\\s(\\S+)\\s.*\n\n.*\n\n(.*)\n\n", Pattern.DOTALL)

Where group 1 and 2 represents the URL and the HTML content respectively. There's two problem with this code:

  1. It's very slow to find a match.
  2. Only matches with the first page.

Java Codes:

if(mStreamScanner.findWithinHorizon(PAGE_ENTRY, 0) == null){
    return null;
} else {
    MatchResult result = mStreamScanner.match();
    return new WarcPageEntry(result.group(1), result.group(2));
}

Questions:

  • Why is my code only parsing the first page entry?
  • Is there a faster way to parse a large text in a streaming manner?
frogatto
  • 28,539
  • 11
  • 83
  • 129

1 Answers1

0

I wouldn't tackle these huge HTML strings with a regex. How about relying on the structure of the document, instead?

E.g. like so:

HashMap<String, String> output = new HashMap<>();
Pattern pattern = Pattern.compile("^warc\\/0\\.9\\s\\d+\\sresponse\\s(\\S+)\\s.*");

try (InputStreamReader is = new InputStreamReader(new FileInputStream("excerpt.txt"), "UTF-8")) {               
    try (BufferedReader br = new BufferedReader(is)) {      
        String line;        
        while ((line = br.readLine()) != null) {
            Matcher matcher = pattern.matcher(line);

            if (matcher.matches()) {
                entityLoop: while (true) {
                    String url = matcher.group(1);

                    // skip header
                    int countEmptyLines = 0;
                    while ((line = br.readLine()) != null) {
                        if ("".equals(line)) {
                            countEmptyLines++;
                            if (countEmptyLines == 2) break;
                        }
                    }

                    // extract HTML
                    StringBuilder sb = new StringBuilder();
                    while ((line = br.readLine()) != null) {
                        matcher = pattern.matcher(line);
                        if (matcher.matches()) { 
                            // got all HTML; store our findings
                            output.put(url, sb.toString());
                            continue entityLoop; 
                        }
                        sb.append(line);
                    }
                    break; // no more url/html-entities available
                }
            }
        }
    }       
} catch (IOException e) {
    // do something smart
}

// now all your extracted data is stored in "output"

There is still room for improvement in the above code. But it should give you an idea on how to get started.

morido
  • 1,027
  • 7
  • 24