12

I am trying to search a large text file (400MB) for a particular String using the following:

File file = new File("fileName.txt");
try {
    int count = 0;
    Scanner scanner = new Scanner(file);
    while(scanner.hasNextLine()) {
        if(scanner.nextLine().contains("particularString")) {
            count++;
            System.out.println("Number of instances of String: " + count);
        }
    }
} catch (FileNotFoundException e){
    System.out.println(e);
}

This works fine for small files however for this particular file and other large ones it takes far too long (>10mins).

What would be the quickest, most efficient way of doing this?

I have now changed to the following and it completes within seconds -

try {
        int count = 0;
        FileReader fileIn = new FileReader(file);
        BufferedReader reader = new BufferedReader(fileIn);
        String line;
        while((line = reader.readLine()) != null) {
            if((line.contains("particularString"))) {
                count++;
                System.out.println("Number of instances of String " + count);
            }
        }
    }catch (IOException e){
        System.out.println(e);
    }
Chief DMG
  • 133
  • 1
  • 1
  • 8
  • 3
    Compare the speed to `grep -c particularString fileName.txt`. – Andy Turner Apr 28 '16 at 14:13
  • Isn't it going to be faster if he first reads the entire file to memory? – Evdzhan Mustafa Apr 28 '16 at 14:13
  • 1
    One very trivial thing that's unrelated to your file accessing methodology is the `System.out.println` invocation: if you have a large number of matches, it *will* actually slow down your execution, as you're building and printing a new `String` every time. Of course, this is not the real optimization you're looking for here. – Mena Apr 28 '16 at 14:15
  • 1
    Possible duplicate of [what's the fastest way to scan a very large file in java?](http://stackoverflow.com/questions/4886154/whats-the-fastest-way-to-scan-a-very-large-file-in-java) – Adnan Isajbegovic Apr 28 '16 at 14:16
  • What's the time without printing? If still bad performance, you could try `BufferedReader().lines().forEach(..)` to parallelize the read. maybe. – membersound Apr 28 '16 at 14:18
  • 1
    @membersound Parallelize the read? Wouldn't you be restricted by disk IO? – Clark Kent Apr 28 '16 at 14:19
  • Does the file actually have lines? – Roman K Apr 28 '16 at 14:25
  • Most obvious speed up, DO NOT use Scanner, very slow compared to BufferedReader. Scanner does A LOT of parsing. – jn1kk Apr 28 '16 at 14:31

3 Answers3

8

1st figure out how long it takes you to actually read the entire file's contents vs how long it takes to scan them for your pattern.

if your results are dominated by the read time (and assumming you read it properly, so channels or at the very least buffered readers) there's not much to do.

if its the scanning time that dominates you could read all lines and then ship small batches of lines to be searched in to a work queue, where you could have multiple threads picking up line batches and searching in them.

ballpark figures

  • assuming 50 MB/sec as the hard drive read speed (and thats slow by modern standards) you should be able to read up the entire file into memory in <10 seconds.
  • looking at MD5 hashing speed benchmarks (example here) shows us that the hashing rate can be at least as fast (often faster) than disk read speed. also, string searching is faster, simpler and parallelizes better than hashing.

given those 2 estimates i think a proper implementation can easily land you a run time on the order of 10 seconds (if you start kicking off search jobs as you read line batches), and be largely dominated by your disk read time.

Community
  • 1
  • 1
radai
  • 23,949
  • 10
  • 71
  • 115
  • 1
    Good answer, I think a lot of people would be more inclined to just implement the batching and expect it to be super fast, but in reality the slowness could be from something else. – Brandon Ling Apr 28 '16 at 14:38
  • Thanks. I have changed to a buffered reader and it has done the job, takes a few secs now. – Chief DMG Apr 28 '16 at 15:33
2

Scanner is simply not useful in this case. Under the hood, it does all kinds of input parsing, checking, caching and whatnot. If your case is simply "iterate over all lines of a file", use something that is based on a simple BufferedReader.

In your particular case, I recommend using Files.lines.

Example:

  long count = Files.lines(Paths.get("testfile.txt"))
     .filter(s -> s.contains("particularString"))
     .count();
  System.out.println(count);

(Note that this particular case of the streaming api probably does not cover what you are actually trying to achieve - unfortunately your question does not indicate what the result of the method should be.)

On my system, I get about 15% of Scanner runtime with Files.lines() or a buffered reader.

mtj
  • 3,381
  • 19
  • 30
-1

Use a method from Scanner object - FindWithinHorizon. Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.