How to search a word from multiple documents in java?

Question

My actual requirement is to list all the files in the given directory which contains the search phrase textToMatch in minimum amount of time about 4-5 seconds, where number of files could be upto 100000 or more.

I don't want code, just I want a best algorithm for this.

Take a look at https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html — Emil Kantis, Nov 16 '16 at 08:07
Don't think you can achieve this without building up a search index. — Henry, Nov 16 '16 at 08:20
For an idea how a fast algorithm could be implemented, take a look at http://stackoverflow.com/a/12630617/6999902 — , Nov 16 '16 at 11:13

score 1 · Answer 1 · 2016-11-16T08:33:10.463

1

Since you will have to open every file, you can also use a tool build for this specific task. Use grep:

We have 100000 files to look at.

% ls -l *.txt | wc -l          
100000

They contain Vestibulum.

% grep Vestibulum 1.txt        
Aenean commodo ultrices imperdiet. Vestibulum ut justo vel sapien venenatis tincidunt.
euismod ultrices facilisis. Vestibulum porta sapien adipiscing augue congue id pretium lectus

Count the files containing Vestibulum, time this.

% time grep -l Vestibulum *.txt | wc -l
100000
grep --color=auto -l Vestibulum *.txt  0,28s user 0,25s system 99% cpu 0,537 total
wc -l  0,00s user 0,01s system 1% cpu 0,537 total

As you see, this takes only have a second on my machine.

edited Nov 16 '16 at 08:33

answered Nov 16 '16 at 08:22

What is `grep` and how i can do this using `grep` in java can u give me full program ? – Mickey Patel Nov 16 '16 at 10:00
How I can do this in java. I am using Windows system. – Mickey Patel Nov 16 '16 at 10:09
1

Since the question is tagged as a java question, Mickey Patel is probably just asking for an algorithm description and not for specific operating systems commands and tools. – Costis Aivalis Nov 16 '16 at 11:01
Why reimplement `grep` in Java? – Nov 16 '16 at 11:11
Why the java tag then? Let him explain. – Costis Aivalis Nov 16 '16 at 11:12
ya i want to do this using java application. I don't want to use any specific tools for this. – Mickey Patel Nov 16 '16 at 11:18
Well, the you have some options: 1) find a library that implements a *fast* algorithm as `grep` does (I know no such library), 2) implement such an algorithm yourself (good luck). – Nov 16 '16 at 11:26

Costis Aivalis · Answer 2 · 2016-11-16T13:29:20.460

Your program must deal with 2 issues:

Locating each and every file in each and every subdirectory and
Searching for the phrase you need inside every file.

For 1: You can search the given directory for files either iteratively or recursively or let Java 7 or 8 do the work for you by using either a FileVisitor or Apache Commons IO.

For 2: You could use a Java Scanner or implement your self the very fast algorithm for searching inside files, called the Boyer-Moore algorithm.

How to search a word from multiple documents in java?

2 Answers2