Pattern matching in Thousands of files

Question

I've a regex pattern of words like welcome1|welcome2|changeme... which I need to search for in thousands of files (varies between 100 to 8000) ranging from 1KB to 24 MB each, in size.

I would like to know if there's a faster way of pattern matching than doing what I have been trying.

Environment:

jdk 1.8
Windows 10
Unix4j Library

Here's what I tried till now

try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                                    .filter(FilePredicates.isFileAndNotDirectory())) {

        List<String> obviousStringsList = Strings_PASSWORDS.stream()
                                                .map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this

        Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));

        GrepOptions options = new GrepOptions.Default(GrepOption.count,
                                                        GrepOption.ignoreCase,
                                                        GrepOption.lineNumber,
                                                        GrepOption.matchingFiles);
        Instant startTime = Instant.now();

        final List<Path> filesWithObviousStringss = stream
                .filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
                .collect(Collectors.toList());

        System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}

I get Time taken = 60 seconds which makes me think I'm doing something really wrong.

I've tried different ways with the stream and on an average every method takes about a minute to process my current folder of 6660 files.

Grep on mysys2/mingw64 takes about 15 seconds and exec('grep...') in node.js takes about 12 seconds consistently.

I chose Unix4j because it provides java native grep and clean code.

Is there a way to produce better results in Java, that I'm sadly missing?

If I were you, I would create some FileNameFilters and fetch only the partially matching subset of files. The streaming approach you did is correct I think, another thing you may do is to run your search in multiple threads, but that probably will not help because of the IO operations you have to made meanwhile. Please have a look at [this](https://stackoverflow.com/questions/31706058/get-large-directory-content-faster-java-io-file-alternatives) too as an example of using filtering while streaming folder contents. — m4gic, Aug 28 '18 at 12:41
@m4gic, I don't need FileNameFilters. I'm searching for text inside the file, not searching based on file names. — Gaurav Goenka, Aug 28 '18 at 13:31
I’m a bit surprised about the changed options. Only `GrepOption.matchingFiles` looks like what you want here, as you’re only interested in the files (which allows `Unix4j.grep` to stop on the first match within a file, just like the other solutions), but this renders `GrepOption.count` and `GrepOption.lineNumber` obsolete (or contradicting). Further, as far as I understood, when you provide a precompiled `Pattern` object, the `GrepOption.ignoreCase` can’t have an effect. You have to provide `Pattern.CASE_INSENSITIVE` to `Pattern.compile` if you want case insensitive matching. — Holger, Aug 29 '18 at 07:43
I did run into this issue yesterday and I realised `GrepOption.count`, `GrepOption.lineNumber` and `GrepOption.matchingFiles` are contradicting or should I say mutually exclusive to each other. I made the exact changes you suggested. Maybe you should highlight that point somewhere so its more visible. — Gaurav Goenka, Aug 29 '18 at 12:35

Holger · Answer 1 · 2018-08-28T15:45:33.537

I don’t know what “Unix4j” provides that isn’t already in the JDK, as the following code does everything with built-in features:

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
        Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
        long startTime = System.nanoTime();
        final List<Path> filesWithObviousStringss = stream
                .filter(path -> {
                    try(Scanner s = new Scanner(path)) {
                        return s.findWithinHorizon(pattern, 0) != null;
                    } catch(IOException ex) {
                        throw new UncheckedIOException(ex);
                    }
                })
                .collect(Collectors.toList());
        System.out.println("Time taken = "
            + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

One important property of this solution is that it doesn’t read the whole file, but stops at the first encountered match. Also, it doesn’t deal with line boundaries, which is suitable for the words you’re looking for, as they never contain line breaks anyway.

After analyzing the findWithinHorizon operation, I consider that line by line processing may be better for larger files, so, you may try

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
        Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
        long startTime = System.nanoTime();
        final List<Path> filesWithObviousStringss = stream
                .filter(path -> {
                    try(Stream<String> s = Files.lines(path)) {
                        return s.anyMatch(pattern.asPredicate());
                    } catch(IOException ex) {
                        throw new UncheckedIOException(ex);
                    }
                })
                .collect(Collectors.toList());
        System.out.println("Time taken = "
            + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

instead.

You may also try to turn the stream to parallel mode, e.g.

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
        Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
        long startTime = System.nanoTime();
        final List<Path> filesWithObviousStringss = stream
                .parallel()
                .filter(path -> {
                    try(Stream<String> s = Files.lines(path)) {
                        return s.anyMatch(pattern.asPredicate());
                    } catch(IOException ex) {
                        throw new UncheckedIOException(ex);
                    }
                })
                .collect(Collectors.toList());
        System.out.println("Time taken = "
            + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

It’s hard to predict whether this has a benefit, as in most cases, the I/O dominates such an operation.

every time I measure some of your code against mine, I lose. I am not even going to bother this time... 1+ — Eugene, Aug 28 '18 at 13:00
This would also stop on the first match of the pattern. If I've two words of my pattern in the same file, it will stop looking, for the second one as you said, which is a problem. — Gaurav Goenka, Aug 28 '18 at 13:30
@GauravGoenka why is this a problem? The result you get is a `List` that doesn’t tell you anything about which words matched or how often. That’s not different to your original code. You asked how to make that code faster, not how to get more information. — Holger, Aug 28 '18 at 13:56
@Holger, I get your point. My bad. However, I tested it three times and it takes more than 125 seconds each time as compared to ~60 seconds of my code. — Gaurav Goenka, Aug 28 '18 at 14:33
It’s hard to analyze without an actual test case. Can you provide a link to the Unix4j library you are using? The libraries I find via Google do not match your description. — Holger, Aug 28 '18 at 14:59
I added alternatives, but assuming you used http://www.unix4j.org, I think, there’s no actual fundamental difference in the operations that would make a huge difference in performance. The difference to native tools may stem from processing the data as 8 bit character data instead of Unicode. You would have to rewrite the code fundamentally to get a similar result. — Holger, Aug 28 '18 at 15:48
I completely forgot about `parallelstream()` but adding that saves me almost 50% time. I'm down from 60 seconds to 40 seconds. How silly of me to have forgotten that! — Gaurav Goenka, Aug 28 '18 at 15:55
Right, that’s the same library as I assumed. As said, there are no fundamental difference in the operations so far. I added a new answer which addresses the specific differences to the native tools, with an alternative solution that may close the gap. — Holger, Aug 28 '18 at 16:34

Holger · Accepted Answer · 2018-08-30T13:38:43.680

The main reason why native tools can process such text files much faster, is their assumption of one particular charset, especially when it has an ASCII based 8 Bit encoding, whereas Java performs a byte to character conversion whose abstraction is capable of supporting arbitrary charsets.

When we similarly assume a single charset with the properties named above, we can use lowlevel tools which may increase the performance dramatically.

For such an operation, we define the following helper methods:

private static char[] getTable(Charset cs) {
    if(cs.newEncoder().maxBytesPerChar() != 1f)
        throw new UnsupportedOperationException("Not an 8 bit charset");
    byte[] raw = new byte[256];
    IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
    char[] table = new char[256];
    cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
      .decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
    for(int i = 0; i < 128; i++)
        if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
    return table;
}

and

private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
    try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
        long actualSize = fch.size();
        int size = (int)actualSize;
        if(size != actualSize) throw new UnsupportedOperationException("file too large");
        MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
        final class MappedCharSequence implements CharSequence {
            final int start, size;
            MappedCharSequence(int start, int size) {
                this.start = start;
                this.size = size;
            }
            public int length() {
                return size;
            }
            public char charAt(int index) {
                if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
                byte b = mbb.get(start + index);
                return b<0? table[b+256]: (char)b;
            }
            public CharSequence subSequence(int start, int end) {
                int newSize = end - start;
                if(start<0 || end < start || end-start > size)
                    throw new IndexOutOfBoundsException();
                return new MappedCharSequence(start + this.start, newSize);
            }
            public String toString() {
                return new StringBuilder(size).append(this).toString();
            }
        }
        return new MappedCharSequence(0, size);
    }
}

This allows to map a file into the virtual memory and project it directly to a CharSequence, without copy operations, assuming that the mapping can be done with a simple table and, for ASCII based charsets, the majority of the characters do not even need a table lookup, as their numerical value is identical to the Unicode codepoint.

With these methods, you may implement the operation as

// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
    Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
    long startTime = System.nanoTime();
    final List<Path> filesWithObviousStringss = stream//.parallel()
            .filter(path -> {
                try {
                    return pattern.matcher(mapAsciiBasedText(path, table)).find();
                } catch(IOException ex) {
                    throw new UncheckedIOException(ex);
                }
            })
            .collect(Collectors.toList());
    System.out.println("Time taken = "
        + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

This runs much faster than the normal text conversion, but still supports parallel execution.

Besides requiring an ASCII based single byte encoding, there’s the restriction that this code doesn’t support files larger than 2 GiB. While it is possible to extend the solution to support larger files, I wouldn’t add this complication unless really needed.

@Eugene which is a good sign, as it implies that the easy-to-use high level APIs had a reasonable performance for your tasks so far (hence, did not require you to dig that deep)… — Holger, Aug 29 '18 at 09:39
@Holger, this throws the `Not an 8 bit charset` exception. I'm parsing `html,js` files. Does that cause this problem? — Gaurav Goenka, Aug 29 '18 at 13:36
Since you specified `Windows 10`, I assumed the default charset to be some Windows codepage like `CP1252`, which is typical for Western Windows installation. But as stated in a comment in my code, some environments like when running a program within an IDE may use a different default. Especially Netbeans is known for using `UTF-8` instead of the system encoding. You may do either, alter the configuration to run with the true default charset or replace `Charset.defaultCharset()` with, e.g. `Charset.forName("CP1252")` (use `[System.Text.Encoding]::Default` in PowerShell to get the right number) — Holger, Aug 29 '18 at 13:51
Wow this low code went absolute crazy! `~28 seconds`. And I'm using `IntelliJ Idea` which seems to be using `UTF-8`, although my Windows machine is using `iso-8859-1`. Changing `Charset.defaultCharset()` to `Charset.forName("UTF-8")` works. — Gaurav Goenka, Aug 30 '18 at 07:20
@Holger, how do I make it return more info? Like the word that was found from the pattern, the line number possibly? To actually represent a `grep`? I'm not following the program completely, hence getting these difficults is tricky for me, yet! — Gaurav Goenka, Aug 30 '18 at 08:00
Did you use parallel or sequential? I wouldn’t expect `Charset.forName("UTF-8")` to work, as it is not a single-byte charset. But `Charset.forName("ISO-8859-1")` should work, but I suspect it is actually `Charset.forName("CP1252")`, which is based on iso-8859-1, but extends it. If the pattern only matches ascii characters, you could also use `Charset.forName("US-ASCII")`, which ignores all non-ascii input characters which would even work for UTF-8 input for that pattern then. — Holger, Aug 30 '18 at 08:03
When you replace `filter` with `flatMap` and `find()` with `results()` (Java 9 or newer), you get a stream of `MatchResult` objects, which can report the position and actual match. Line numbers are not available without additional work, as that code doesn’t ever search for line breaks. But when you have the index within the `CharSequence`, you can check how many line breaks are before that position, to get the line number. To get `Path` and `MatchResult`, you would have to `map` to an object holding both. — Holger, Aug 30 '18 at 08:05
parallel gave me `28s` and sequential gave me `36s` almost consistently. Also, `UTF-8` works. I had to change `if(cs.newEncoder().maxBytesPerChar() != 1f)` to be `if(cs.newEncoder().maxBytesPerChar() != 3f)`. Maybe you should add that edit to the answer! — Gaurav Goenka, Aug 30 '18 at 09:43
No, the test is intentional, as it is known that the code will not handle non-ASCII characters of UTF-8 correctly. Changing the test to `… != 3f` will cause it to reject those charsets, it can handle, for the sake of accepting a charset that it can’t. As said in my previous comment, if you know for sure, that your *pattern* will match ASCII characters only, you can make it work, as it is not required to handle non-ASCII characters correctly then. In that case, simply specify `Charset.forName("US-ASCII")` or `StandardCharsets.US_ASCII`, but keep the test as written in my answer. — Holger, Aug 30 '18 at 09:48
I used `StandardCharsets.US_ASCII` and that works just fine. — Gaurav Goenka, Aug 30 '18 at 10:31
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179107/discussion-between-gaurav-goenka-and-holger). — Gaurav Goenka, Aug 30 '18 at 11:49

sp00m · Answer 3 · 2018-08-29T06:58:28.330

1

I never used Unix4j yet, but Java provides nice file APIs as well nowadays. Also, Unix4j#grep seems to return all the found matches (as you're using .toStringResult().isEmpty()), while you seem to just need to know whether at least one match got found (which means that you should be able to break once one match is found). Maybe this library provides another method that could better suit your needs, e.g. something like #contains? Without the use of Unix4j, Stream#anyMatch could be a good candidate here. Here is a vanilla Java solution if you want to compare with yours:

private boolean lineContainsObviousStrings(String line) {
  return Strings_PASSWORDS // <-- weird naming BTW
    .stream()
    .anyMatch(line::contains);
}

private boolean fileContainsObviousStrings(Path path) {
  try (Stream<String> stream = Files.lines(path)) {
    return stream.anyMatch(this::lineContainsObviousStrings);
  }
}

public List<Path> findFilesContainingObviousStrings() {
  Instant startTime = Instant.now();
  try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))) {
    return stream
      .filter(FilePredicates.isFileAndNotDirectory())
      .filter(this::fileContainsObviousStrings)
      .collect(Collectors.toList());
  } finally {
    Instant endTime = Instant.now();
    System.out.println("Time taken = " + Duration.between(startTime, endTime).getSeconds() + " seconds");
  }
}

edited Aug 29 '18 at 06:58

answered Aug 28 '18 at 12:36

sp00m

47,968
31
142
252

1

no idea if this would perform better, but boy it is a LOT cleaner! 1+ – Eugene Aug 28 '18 at 12:50
I need to know the names of all files that matched with any word in the regex, not just the first match. Any match will return on the first match itself, which is not very useful in my case. My code returns all the matches of all the files. – Gaurav Goenka Aug 28 '18 at 13:24
@Holger, I may be completely wrong, but doesn't `return Files.walk...` above consume the stream and hence changes its state? – Gaurav Goenka Aug 28 '18 at 13:38
3

@GauravGoenka consuming the stream is not the same as closing the stream. Closing a stream is an action that is only needing for the few streams backed by an underlying resource that must be released. You did it right in the question by using `try (Stream stream = Files.walk(Paths.get(FILES_DIRECTORY)) …) { /* consume the stream */ }`. – Holger Aug 28 '18 at 13:54
@sp00m, this always gives a malformed URL exception. I'm not sure I follow what's throwing this. – Gaurav Goenka Aug 28 '18 at 15:10

score 0 · Answer 4 · answered Aug 28 '18 at 15:12

Please try this out too (if it is possible), I am curious how it performs on your files.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class Filescan {

    public static void main(String[] args) throws IOException {
        Filescan sc = new Filescan();
        sc.findWords("src/main/resources/files", new String[]{"author", "book"}, true);
    }

    // kind of Tuple/Map.Entry
    static class Pair<K,V>{
        final K key;
        final V value;

        Pair(K key, V value){
            this.key = key;
            this.value = value;
        }

        @Override
        public String toString() {
            return key + " " + value;
        }
    }

    public void findWords(String directory, String[] words, boolean ignorecase) throws IOException{

        final String[] searchWords = ignorecase ? toLower(words) : words;

        try (Stream<Path> stream =     Files.walk(Paths.get(directory)).filter(Files::isRegularFile)) {
            long startTime = System.nanoTime();
            List<Pair<Path,Map<String, List<Integer>>>> result = stream
                    // you can test it with parallel execution, maybe it is faster
                    .parallel()
                    // searching
                    .map(path -> findWordsInFile(path, searchWords, ignorecase))
                    // filtering out empty optionals
                    .filter(Optional::isPresent)
                    // unwrap optionals
                    .map(Optional::get).collect(Collectors.toList());
            System.out.println("Time taken = " +     TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()
                            - startTime) + " seconds");
            System.out.println("result:");
            result.forEach(System.out::println);
        }
    }

    private String[] toLower(String[] words) {
        String[] ret = new String[words.length];
        for (int i = 0; i < words.length; i++) {
            ret[i] = words[i].toLowerCase();
        }
        return ret;
    }

    private static Optional<Pair<Path,Map<String, List<Integer>>>>     findWordsInFile(Path path, String[] words, boolean ignorecase) {
        try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path.toFile())))) {
            String line = br.readLine();
            line = ignorecase & line != null ? line.toLowerCase() : line;
            Map<String, List<Integer>> map = new HashMap<>();
            int linecount = 0;
            while(line != null){
                for (String word : words) {
                    if(line.contains(word)){
                        if(!map.containsKey(word)){
                            map.put(word, new ArrayList<Integer>());
                        }
                        map.get(word).add(linecount);
                    }
                }
                line = br.readLine();
                line = ignorecase & line != null ? line.toLowerCase() : line;
                linecount++;
            }
            if(map.isEmpty()){
                // returning empty optional when nothing in the map
                return Optional.empty();
            }else{
                // returning a path-map pair with the words and the rows where each word has been found
                return Optional.of(new Pair<Path,Map<String, List<Integer>>>(path, map));
            }
        } catch (IOException ex) {
            throw new UncheckedIOException(ex);
        }
    }    
}

Pattern matching in Thousands of files

4 Answers4

Linked