Speeding up reading of multiple files and comparing their contents to multiple strings

Question

Problem: I have an array of some 700 strings, which I'm reading into a List. I then have a directory containing over 1500 files. I need to open every one of these files and see if any of the 700 strings appear anywhere within each of them.

Current solution: After reading in the 700 strings (which is pretty much instantaneous), this is what I'm doing:

public static void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) {
    for (final File fileEntry : myDirectory.listFiles()) {
        System.out.println("Entering file: " + currentCount++);
        if (fileEntry.isDirectory()) {
            scanMyDirectory(fileEntry, listOfStrings);
        } else {

            BufferedReader br = null;
            try {
                String sCurrentLine;
                br = new BufferedReader(new FileReader(fileEntry.getPath()));

                while ((sCurrentLine = br.readLine()) != null) {
                    for (int i = 0; i < listOfStrings.size(); i++) {
                        if (org.apache.commons.lang3.StringUtils.containsIgnoreCase(sCurrentLine, listOfStrings.get(i))) {
                            matchLocations.put(listOfStrings.get(i), fileEntry.getPath());
                        }
                    }
                }
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try {
                    if (br != null) {
                        br.close();
                    }
                } catch (IOException ex) {
                    ex.printStackTrace();
                }
            }
        }
    }
}

After calling this procedure, I have all the results stored in a HashMap and I can output the results to screen or file.

Question: What is the faster way to do this? It seems extremely slow (taking around 20-25 minutes to run through ~1500 files). I'm not very familiar with threading, but I've considered using it. However, the top answer in this question has put me off a little. What is the best way to speed up performance?

According to the answer you linked, multithreading this is not going to be a good idea. Are you using NIO, as that answer suggests? — Azar, Nov 03 '14 at 11:48
No. It's another thing I'm considering. I was hoping to gauge as many responses as possible before delving down a particular route. — Andrew Martin, Nov 03 '14 at 11:50
The answer, you linked, is correct. Unless you're reading from 15 different SSDs the file reading will be the bottleneck. — Michael, Nov 03 '14 at 11:50
In the answer I linked he provided a link to a page with three possible solutions: Read a small file in buffer of file size, Read a large file in chunks with fixed size buffer or Faster file copy with mapped byte buffer. Does anyone have any idea which is better/preferred? Or are they all much of a muchness? — Andrew Martin, Nov 03 '14 at 11:52
you will have to replace your code with Multithreading in java. It seems like a pooling scenarion that you are encountering here. or better use Spring Batch which is provided for these kind of requirements. — vikeng21, Nov 03 '14 at 11:55
Have you considered combining your 700 strings into one regular expression (like `(\Qaaa\E|\Qbbb\E|...)`) to speed up finding strings in lines? (If you strings may be prefixes of other strings you still will have to loop, but pattern matching may serve as shortcut). — halfbit, Nov 03 '14 at 12:16
I hadn't actually, no. Would the performance difference be noticable? Or is the bottleneck coming from the actual opening files, as opposed to the scanning of their contents? — Andrew Martin, Nov 03 '14 at 12:17
What about full text search like Lucene/Solr? You would only have to index your files once, then you could search as many strings you like. — Stefan, Nov 03 '14 at 12:19
@AndrewMartin It is hard to tell whether combining into one regex would make a difference. It depends how fast reading from files actually is in your case. Have you checked CPU load during a run on your system? If CPU load is above say 80% you might want to give it a try. — halfbit, Nov 03 '14 at 12:24

FaNaJ · Accepted Answer · 2014-11-03T14:56:34.220

I prefer NIO to read lines :

private final Map<String, String> matchLocations = new HashMap<>();
private int currentCount = 0;

public void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) {
    File[] files = myDirectory.listFiles();
    if (files == null) {
        return;
    }
    Stream.of(files).forEach(fileEntry -> {
        if (fileEntry.isDirectory()) {
            scanMyDirectory(fileEntry, listOfStrings);
        } else {
            System.out.println("Entering file: " + currentCount++);
            try {
                List<String> lines = Files.readAllLines(Paths.get(fileEntry.getAbsolutePath()), StandardCharsets.UTF_8);
                StringBuilder sb = new StringBuilder();
                lines.forEach(s -> sb.append(s.toLowerCase()).append("\n"));
                listOfStrings.forEach(s -> {
                    if (sb.indexOf(s.toLowerCase()) > 0) {
                        matchLocations.put(s, fileEntry.getPath());
                    }
                });
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    });
}

as mentioned above, there's no need to multithreading...but if you're interested :

private final ConcurrentHashMap<String, String> matchLocations = new ConcurrentHashMap<>();
private final ForkJoinPool pool = new ForkJoinPool();
private int currentCount = 0;

public void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) {
    File[] files = myDirectory.listFiles();
    if (files == null) {
        return;
    }
    Stream.of(files).forEach(fileEntry -> {
        if (fileEntry.isDirectory()) {
            scanMyDirectory(fileEntry, listOfStrings);
        } else {
            System.out.println("Entering file: " + currentCount++);
            pool.submit(new Reader(listOfStrings, fileEntry));
        }
    });
}

class Reader implements Runnable {

    final List<String> listOfStrings;
    final File file;

    Reader(List<String> listOfStrings, File file) {
        this.listOfStrings = listOfStrings;
        this.file = file;
    }

    @Override
    public void run() {
        try {
            List<String> lines = Files.readAllLines(Paths.get(file.getAbsolutePath()), StandardCharsets.UTF_8);
            StringBuilder sb = new StringBuilder();
            lines.forEach(s -> sb.append(s.toLowerCase()).append("\n"));
            listOfStrings.forEach(s -> {
                if (sb.indexOf(s.toLowerCase()) > 0) {
                    matchLocations.put(s, file.getPath());
                }
            });
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

EDIT

bug fixes :

private final ConcurrentHashMap<String, List<String>> matchLocations = new ConcurrentHashMap<>();
private final ForkJoinPool pool = new ForkJoinPool();
private int currentCount = 0;

public void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) {
    File[] files = myDirectory.listFiles();
    if (files == null) {
        return;
    }
    Stream.of(files).forEach(fileEntry -> {
        if (fileEntry.isDirectory()) {
            scanMyDirectory(fileEntry, listOfStrings);
        } else {
            System.out.println("Entering file: " + currentCount++);
            Reader reader = new Reader(listOfStrings, fileEntry);
            pool.submit(reader);
        }
    });
}

class Reader implements Runnable {

    final List<String> listOfStrings;
    final File file;

    Reader(List<String> listOfStrings, File file) {
        this.listOfStrings = listOfStrings;
        this.file = file;
    }

    @Override
    public void run() {
        try (FileInputStream fileInputStream = new FileInputStream(file);
             FileChannel channel = fileInputStream.getChannel()) {
            StringBuilder sb = new StringBuilder();
            ByteBuffer buffer = ByteBuffer.allocate(512);
            int read;
            while (true) {
                read = channel.read(buffer);
                if (read == -1) {
                    break;
                }
                buffer.flip();
                sb.append(new String(buffer.array()).toLowerCase());
                buffer.clear();
            }
            listOfStrings.stream()
                    .map(String::toLowerCase)
                    .forEach(s -> {
                        if (sb.indexOf(s) > 0) {
                            List<String> current = matchLocations.get(s);
                            if (current == null) {
                                current = new ArrayList<>();
                                matchLocations.put(s, current);
                            }
                            current.add(file.getAbsolutePath());
                        }
                    });
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

I like this solution, but I've a problem. On some valid files, it's skipping over them and giving a java.nio.charset.MalformedInputException: Input length = 1 error instead. When I open them, they look fine to me. What could the issue be? — Andrew Martin, Nov 03 '14 at 13:01
@AndrewMartin probably it's due to the encoding of the files...I've edited my post — FaNaJ, Nov 03 '14 at 13:15
I think it is. I trialled it with lower case and changed my list of strings to lower case and it matches my original results. Thanks very much for this. — Andrew Martin, Nov 03 '14 at 13:24

Speeding up reading of multiple files and comparing their contents to multiple strings

1 Answers1