3

Here is what i want to do: on the one side, i have a text file with ~100.000 string patterns (each String is in a new line), most of them are about 40-200 characters long. On the other side, i have ~130.000 files with sizes anywhere from a just a few kiloBytes up to large files with several hundered megaBytes (however, 95% of the files are just a few 100kB).

Now, i want to match every one of the 130k files against all of the 100k patterns.

Right now i am doing the matching using the .contains() method, here is some example code:

String file = readFile(somefile.pdf); // see benchmark below
String[] patterns = readFile(patterns.txt).split("\n"); // read 100k patterns into an array
for(int i = 0; patterns.length-1; i++){
    if(file.contains(patterns[i])){
        // pattern matched 
    }else{
        // patttern not matched
    }
}

I am running this on a rather powerful desktop system (4core 2.9ghz, 4GB memory, SSD) and i get very poor performance:

When somefile.pdf is a 1.2mb file, a match against all 100k patterns takes ~43 seconds. A 400kb is ~14seconds. A 50kb file is ~2 Seconds

This is way too slow, i need something with 40x-50x times the performance. What can i do?

  • 2
    What do you mean with 'pattern'? Is it just a string that you want to find in your documents? Or is it more complex, like a regular expression pattern? – isnot2bad Nov 18 '13 at 10:14
  • For clarification: the 130k files are all kind of files (images, executables, office files... whatever). I read these files as Hex. The Patterns are also Hex patterns. So everything is rather abstract and obfuscated, no natural language or anything – user3004200 Nov 18 '13 at 10:29

5 Answers5

2

Creating a search index over these 130k files would probably the most sustainable approach.

A similar question was answered over here: Searching for matches in 3 million text files

Libraries / Tools that are typically used in Java environments:

  • Lucene
  • Solr
  • elasticsearch
Community
  • 1
  • 1
reto
  • 9,995
  • 5
  • 53
  • 52
1

You can introduce some shortcuts if you haven't already.

If a file needs to match all patterns, you can return false as soon as it doesn't match a pattern. Order your patterns to put the ones most likely to fail at the top.
(If a file on the other hand needs to match any pattern, you can return true as soon as the first pattern matches. In this case, order your patterns to put the ones most likely to match at the top.)

If you want all files to match all patterns, make sure you load the smallest files first. That way, you process the ones that are the easiest to compare first. You could also try to load them so that you process the ones most likely to fail first, but that seems (to me) harder to do for the files as for the patterns.


Also, make sure you load your patterns only once.

SQB
  • 3,926
  • 2
  • 28
  • 49
  • Already did this. In most cases (99%) the file will not match any pattern so the shortcuts don't really save a lot of time. – user3004200 Nov 18 '13 at 10:30
0

If you're only comparing words, one possible optimization could be to index the 130'000 files beforehand. In the most simple case this would go like (pseudo-code):

for file in files
  read file as string
  split string into tokens (e.g. by white-space)
  set = all tokens in a HashSet
  for every pattern in patterns
      if set.contains(pattern)
          // pattern found
      else
          // pattern not found

The problem with your current solution is the usage of arrays, which lead to a full-table scan in database jargon (what you really are sets, or more sphisticated solutions like Apache Lucene).

Matt
  • 17,290
  • 7
  • 57
  • 71
0

You might want constructing and checking out performance of a gigantic regex (i.e. Pattern.compile("word1|word2|word3|...") as it builds a FSA which should have a much better performance than a naive approach.

Similar question:

What's the most efficient way to find one of several substrings in Python?

Community
  • 1
  • 1
Andrey Chaschev
  • 16,160
  • 5
  • 51
  • 68
0

You have a factor of 100,000x sitting in your for loop, where you apply each pattern sequentially.

You need to find a way to compose your 100K independent patterns into one pattern that can be processed efficiently. ".contains" hunts for specific text string; you can do that with a regex match, too. With a regex matcher, you can combine your individual regex patterns into a large one, precompile that, and apply that one. (See Java regex docs http://docs.oracle.com/javase/tutorial/essential/regex/)

This leaves the problem of determining which pattern got hit, if you care. If you examine lexing tools like FLEX, they provide an answer. You give them what amounts to a set of regexes, and they generate a single fast matcher which will tell you which pattern(s) hit.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • This sounds viable. As i wrote in a comment above, 99% of the files will not have any hits. So i dont really need to know right away what pattern was hit, once i have a hit i can do the (slow) .contains method to find out what pattern was hit. – user3004200 Nov 18 '13 at 10:32
  • Using java regexes for this task is a bad idea. Java regexes are processed via backtracking, which means that the efficiency is almost the same than looping. Regexes with FLEX or other deterministic automata based regex engines are not affected by this problem. – CoronA Dec 07 '16 at 06:21
  • Java regexes backtrack? Ick. Yes, that means my idea doesn't work in Java because you don't get the efficiency of an automation to do recognition. Good to know, sorry to hear it. OK, OP gets to go get FLEX or the Java equivelent. – Ira Baxter Dec 07 '16 at 10:31
  • @user3004200: [this is a little late...] 99% of your files may not have any hits, but you can't figure that out without opening them and scanning their entire contents. With 100K files you are likely to be forced to read them from the disk; you want a lot of threads (IMHO hundreds but you can tune this experimentally) to always by trying open files, so the OS has a whole list of places to fetch, and can schedule the disk heads efficiently. Once you fetch the file, you need to scan it efficiently which is the point of the FSA built by combining all the patterns. – Ira Baxter Dec 07 '16 at 10:35