1

I have huge files (4.5 GB each) and need to count the number of lines in each file that start with a given token. There can be up to 200k occurrences of the token per file.

What would be the fastest way to achieve such a huge file traversal and String detection? Is there a more efficient approach than the following implementation using a Scanner and String.startsWith()?

public static int countOccurences(File inputFile, String token) throws FileNotFoundException {
    int counter = 0;
    try (Scanner scanner = new Scanner(inputFile)) {
        while (scanner.hasNextLine()) {
            if (scanner.nextLine().startsWith(token)) {
                counter++;
            }
        }
    }
    return counter;
}

Note:

  • So far it looks like the Scanner is the bottleneck (i.e. if I add more complex processing than token detection and apply it on all lines, the overall execution time is more or less the same.)
  • I'm using SSDs so there is no room for improvement on the hardware side

Thanks in advance for your help.

nevets1219
  • 7,692
  • 4
  • 32
  • 47
Kraal
  • 2,779
  • 1
  • 19
  • 36
  • 1
    Common! it's NOT a duplicate, have you read the question you're referring to ? I'm talking about files that have millions of lines not 20k, and I'm also talking about token detection. You're "duplicate" tag only shows that you did not read my question, nor the question you're referring to. – Kraal Mar 22 '17 at 18:20
  • 2
    I can't post it as an answer now that the question is closed, but take a look at how `grep` [does that](https://stackoverflow.com/questions/12629749/how-does-grep-run-so-fast). It avoids reading every byte using the [Boyer–Moore search algorithm](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm). – Malt Mar 22 '17 at 18:24
  • I also don't think this is a duplicate. How do this get unmarked? – mikea Mar 22 '17 at 18:26
  • Thanks for your answer Malt, ``Scanner`` is also using this algorithm in its ``searchWithinHorizon`` method. However I have issues using it as the horizon may end in a middle of a token (in such cases I miss some tokens) – Kraal Mar 22 '17 at 18:30
  • I reopened this. I think the least effort approach would be to find a 3rd party lib instead of expecting miracles from `Scanner`. – Kayaman Mar 22 '17 at 19:04
  • 1
    read this: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly . if you want you can read big chunks with ByteBuffer.allocateDirect will save time on read from file, you can parrall the process of the allocated data to search you token and line breaks – Nimrod007 Mar 22 '17 at 19:07
  • 1
    `grep | wc -l` ? –  Mar 22 '17 at 19:36
  • Thank you @Nimrod007 for this article. – Kraal Mar 24 '17 at 16:21

2 Answers2

1

A few pointers (assumption is that the lines are relatively short and the data is really ASCII or similar) :

  • read a huge buffer of bytes at a time, (say 1/4 GB), then chop off the incomplete line to prepend to the next read.

  • search for bytes, do not waste time converting to chars

  • indicate "beginning of line by starting your search pattern with '\n', treat first line specially

  • use high-speed search that reduces search time at the expense of pre-processing (google for "fast substring search")

  • if actual line numbers (rather than the lines) are needed, count the lines in a separate stage

  • This is what I ended up doing. For information, total execution time dropped from 16 minutes down to 1 minute when using 100MB buffers and Boyer-Moore search algorithm. For now it's an approximation as I still have to detect tokens that have been split in two, however even then in the worst case I miss less than 0.025% of the occurrences I'm looking for... which is precise enough for now. Thanks! – Kraal Mar 24 '17 at 16:37
1

We can reduce the problem to searching for \n<token> in a bytestream. In that case, one quick way is to read a chunk of data sequentially from disk (The size is determined empirically, but a good starting point is 1024 pages), and hand that data to a different thread for processing.

Tyzoid
  • 1,072
  • 13
  • 31