Here is what i want to do: on the one side, i have a text file with ~100.000 string patterns (each String is in a new line), most of them are about 40-200 characters long. On the other side, i have ~130.000 files with sizes anywhere from a just a few kiloBytes up to large files with several hundered megaBytes (however, 95% of the files are just a few 100kB).
Now, i want to match every one of the 130k files against all of the 100k patterns.
Right now i am doing the matching using the .contains() method, here is some example code:
String file = readFile(somefile.pdf); // see benchmark below
String[] patterns = readFile(patterns.txt).split("\n"); // read 100k patterns into an array
for(int i = 0; patterns.length-1; i++){
if(file.contains(patterns[i])){
// pattern matched
}else{
// patttern not matched
}
}
I am running this on a rather powerful desktop system (4core 2.9ghz, 4GB memory, SSD) and i get very poor performance:
When somefile.pdf is a 1.2mb file, a match against all 100k patterns takes ~43 seconds. A 400kb is ~14seconds. A 50kb file is ~2 Seconds
This is way too slow, i need something with 40x-50x times the performance. What can i do?