0

My actual requirement is to list all the files in the given directory which contains the search phrase textToMatch in minimum amount of time about 4-5 seconds, where number of files could be upto 100000 or more.

I don't want code, just I want a best algorithm for this.

Mickey Patel
  • 501
  • 1
  • 4
  • 16
  • Take a look at https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html – Emil Kantis Nov 16 '16 at 08:07
  • Use `grep -l textToMatch *.txt`. –  Nov 16 '16 at 08:13
  • Don't think you can achieve this without building up a search index. – Henry Nov 16 '16 at 08:20
  • You *can* do this by using the proper tool. –  Nov 16 '16 at 08:24
  • For an idea how a fast algorithm could be implemented, take a look at http://stackoverflow.com/a/12630617/6999902 –  Nov 16 '16 at 11:13

2 Answers2

1

Since you will have to open every file, you can also use a tool build for this specific task. Use grep:

We have 100000 files to look at.

% ls -l *.txt | wc -l          
100000

They contain Vestibulum.

% grep Vestibulum 1.txt        
Aenean commodo ultrices imperdiet. Vestibulum ut justo vel sapien venenatis tincidunt.
euismod ultrices facilisis. Vestibulum porta sapien adipiscing augue congue id pretium lectus

Count the files containing Vestibulum, time this.

% time grep -l Vestibulum *.txt | wc -l
100000
grep --color=auto -l Vestibulum *.txt  0,28s user 0,25s system 99% cpu 0,537 total
wc -l  0,00s user 0,01s system 1% cpu 0,537 total

As you see, this takes only have a second on my machine.

0

Your program must deal with 2 issues:

  1. Locating each and every file in each and every subdirectory and
  2. Searching for the phrase you need inside every file.

For 1: You can search the given directory for files either iteratively or recursively or let Java 7 or 8 do the work for you by using either a FileVisitor or Apache Commons IO.

For 2: You could use a Java Scanner or implement your self the very fast algorithm for searching inside files, called the Boyer-Moore algorithm.

Costis Aivalis
  • 13,680
  • 3
  • 46
  • 47