I've a regex pattern of words like welcome1|welcome2|changeme
... which I need to search for in thousands of files (varies between 100 to 8000) ranging from 1KB to 24 MB each, in size.
I would like to know if there's a faster way of pattern matching than doing what I have been trying.
Environment:
- jdk 1.8
- Windows 10
- Unix4j Library
Here's what I tried till now
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(FilePredicates.isFileAndNotDirectory())) {
List<String> obviousStringsList = Strings_PASSWORDS.stream()
.map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this
Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));
GrepOptions options = new GrepOptions.Default(GrepOption.count,
GrepOption.ignoreCase,
GrepOption.lineNumber,
GrepOption.matchingFiles);
Instant startTime = Instant.now();
final List<Path> filesWithObviousStringss = stream
.filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
.collect(Collectors.toList());
System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}
I get Time taken = 60 seconds
which makes me think I'm doing something really wrong.
I've tried different ways with the stream and on an average every method takes about a minute to process my current folder of 6660 files.
Grep on mysys2/mingw64 takes about 15 seconds and exec('grep...')
in node.js takes about 12 seconds consistently.
I chose Unix4j because it provides java native grep and clean code.
Is there a way to produce better results in Java, that I'm sadly missing?