Searching for matches in 3 million text files

Question

I have a simple requirement where a user enters a bunch of words and the system scans over 3 million text files and finds files which has those keywords. What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?

I thought of using Scanner class for this but have no idea about performance over such large files. Performance isn't very high priority but it should be in a acceptable standard.

You might want to consider storing the keywords in a database, and use that to find matches. — Jave, Nov 13 '13 at 09:54
I would seriously consider using a database for this approach, databases are made to be optimized on performance. Also you say there are 3 million text files, but later you note `performance over such large files`, do you mean a large amount of files here? The `Scanner` approach could possibly work for normal sized files, but would have a performance hit I suppose. — skiwi, Nov 13 '13 at 10:01
There will be 3 million+ files . Each will have about 14000 words in natural language — Rijo Joseph, Nov 13 '13 at 10:03
Why the "without a complex searching/indexing algorithm" ? define complex, specifiy why you have that constraint? And while you're at it, what is 'acceptable standard' performance? Client gets a response within 3 or 4 business days? :D — Nanne, Nov 13 '13 at 13:40

reto · Answer 1 · 2013-11-13T10:11:58.330

it should be in a acceptable standard

We don't know what an acceptable standard is. If we talk about interactive users there probably won't be a simple solution that scans 3 million files and returns something within lets say < 5 seconds.

A reasonable solution would be a search index, potentially based on Lucence.

The major problem with a scanner/grep/find etc. based solution is that they are slow, won't scale and that the expensive scanning work will have to be done over and over (unless you store intermediate results... but that would not be simple and basically a labor expensive re-implementation of an indexer). When working with an index only the creation and updates of the index are expensive, queries are cheap.

score 0 · Accepted Answer · answered Nov 13 '13 at 10:01

What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?

A complex searching/indexing algorithm. There's no need to reinvent the wheel here. Since the user can enter any words, you can't make a simple preprocessing step, but rather have to index for all words in the text. This is what something like Lucene does for you.

There is no other fast way to search through text other than by preprocessing it and building an index. You can roll your own solution for this or you can just use Lucene.

Naïve text search with no preprocessing will be far far too slow to be of any use.

Thank you all. Ive decided to use solr for indexing. – Rijo Joseph Nov 13 '13 at 10:12 — Rijo Joseph, Nov 13 '13 at 10:12

score 0 · Answer 3 · answered Nov 13 '13 at 10:01

0

Why don't you wrap a system call to grep? You can achieve that through the Runtime class.

answered Nov 13 '13 at 10:01

benjamin.d

2,801
3
23
35

Grep may not be available on the system, as is the case if you're deploying on windows. Grep will also be slower than a proper index. – Markus Koivisto Nov 13 '13 at 10:01
It is true that an index will be faster. But grep is also available on Windows and if performance is not high priority it will be a much faster development. And you can still parallelize your grep calls. – benjamin.d Nov 13 '13 at 10:05
grep is available on windows only if you go out of your way to install it. you can't count on it being available on your target system unless you are actually responsible for your target system. – Markus Koivisto Nov 13 '13 at 16:02

score 0 · Answer 4 · answered Nov 13 '13 at 10:02

0

When parsing each text file, I would use BufferedReader and check each line of text for a match.

BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
   // Does this line containe the text?
   if(line.contains(text)) {
      System.out.println("Text found");
   }
}
br.close();

I'm not sure if this would be very fast for such huge number of files.

answered Nov 13 '13 at 10:02

dARKpRINCE

1,538
2
13
22

2

This would be horrendously slow. – Markus Koivisto Nov 13 '13 at 16:20

score 0 · Answer 5 · answered Nov 13 '13 at 10:03

What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm

If you don't use any kind of indexing algorithm then each time you submit a search, you will need to read every file. The overhead in doing so, doesn't lie in the 'matching' algorithm but in the I/O latency. So, I wouldn't care too much about what to use for matching; Scanner is straightforward choice.

If you want to increase performance, you will need to use some sort of pre-processing. You could load the files in memory, size permitting. You could create a set of words per file (index). There are too many algorithms out there for you to search for, especially as 'word count' examples in Map/Reduce contexts. You might also want to have a look into Java's Fork/Join framework if you want to achieve higher concurrency.

Searching for matches in 3 million text files

5 Answers5

Linked