I have huge files (4.5 GB each) and need to count the number of lines in each file that start with a given token. There can be up to 200k occurrences of the token per file.
What would be the fastest way to achieve such a huge file traversal and String detection? Is there a more efficient approach than the following implementation using a Scanner
and String.startsWith()
?
public static int countOccurences(File inputFile, String token) throws FileNotFoundException {
int counter = 0;
try (Scanner scanner = new Scanner(inputFile)) {
while (scanner.hasNextLine()) {
if (scanner.nextLine().startsWith(token)) {
counter++;
}
}
}
return counter;
}
Note:
- So far it looks like the
Scanner
is the bottleneck (i.e. if I add more complex processing than token detection and apply it on all lines, the overall execution time is more or less the same.) - I'm using SSDs so there is no room for improvement on the hardware side
Thanks in advance for your help.