0

If you are given a list of documents, with strings in the documents, how do you go about and search from the documents and return the list of documents that contains the string that you were searching for?

How would I go about implementing a program in Python or C for this problem statement? I've considered grep, but I'm not sure how implementing that inside of a native Python/C application would work.

Thought process at the moment is simply to parse through documents in a loop, then parse through all strings, etc., but it seems a little inefficient.

Any help appreciated.

wowdavers
  • 111
  • 1
  • 2
  • 9
  • 2
    You mean, the equivalent of `grep -le pattern document1 document2 .. documentN`? – Nominal Animal Oct 30 '16 at 19:38
  • @NominalAnimal Yeah, I think so. I've only used grep once or twice, but I'm looking to implement with Python or C. – wowdavers Oct 30 '16 at 21:24
  • Both [C](http://man7.org/linux/man-pages/man3/popen.3.html) and Python [2](https://docs.python.org/2/library/subprocess.html) and [3](https://docs.python.org/3/library/subprocess.html) support `popen()`. With it, you can run a shell command like the `grep` mentioned above, and read the results as `grep` produces them, without using a temporary file or other such mess. – Nominal Animal Oct 30 '16 at 22:02

1 Answers1

2

The simple solution is just as you stated: loop through the files and search through each one.

Naive approach

for file in files:
  for line in file:
    if line contains pattern:
      print file.name

If you wanted to be a little better, you could immediately bail out of the file as soon as you found a match.

Slightly better

for file in files:
  for line in file:
    if line contains pattern:
      print file.name
      break # found what we were looking for. continue to next file

At this point you could attempt to distribute the problem across multiple threads. You will probably be IO bound and may even see worse performance because multiple threads are trying to read different parts of the disk at the same time

Threaded approach

for file in files:
  # create new worker thread which does...
  for line in file:
    if line contains pattern:
      # insert filename into data structure
      break # found what we were looking for. continue to next file
# wait for all threads to finish, collect and display data

But if you are concerned about performance, you should either use grep or copy how it works. It saves time by reading the files as raw binary (rather than break it up line by line) and makes use of a string searching algorithm called the Boyer–Moore algorithm. Refer to this other SO about how grep runs fast.

Probably What You Want™ approach

grep -l pattern files
Community
  • 1
  • 1
  • You Don't have to read the files line by line, you can read the entire file and search it with a regular expression. – wwii Oct 30 '16 at 22:19
  • @wwii What's the syntax for reading the entire file? I've only been able to find approaches that read line by line. – wowdavers Oct 30 '16 at 22:46
  • @wowdavers [```with open('file') as f: s = f.read()```](https://docs.python.org/3/library/io.html#io.TextIOBase.read) – wwii Oct 30 '16 at 22:58