1

I am trying to search a large group of text files (160K) for a specific string that changes for each file. I have a text file that has every file in the directory with the string value I want to search. Basically I want to use python to create a new text file that gives the file name, the string, and a 1 if the string is present and a 0 if it is not.

The approach I am using so far is to create a dictionary from a text file. From there I am stuck. Here is what I figure in pseudo-code:

**assign dictionary**
d = {}
with open('file.txt') as f:
  d = dict(x.rstrip().split(None, 1) for x in f)

**loop through directory**
for filename in os.listdir(os.getcwd()):

***here is where I get lost***
match file name to dictionary
look for string
write filename, string, 1 if found
write filename, string, 0 if not found

Thank you. It needs to be somewhat efficient since its a large amount of text to go through.

Here is what I ended up with

d = {}
with open('ibes.txt') as f:
  d = dict(x.rstrip().split(None, 1) for x in f)

import os

for filename in os.listdir(os.getcwd()):
    string = d.get(filename, "!@#$%^&*")
    if string in open(filename, 'r').read():
        with open("ibes_in.txt", 'a') as out:
            out.write("{} {} {}\n".format(filename, string, 1))
    else: 
        with open("ibes_in.txt", 'a') as out:
            out.write("{} {} {}\n".format(filename, string, 0))

1 Answers1

0

As I understand your question, the dictionary relates file names to strings

d = {
 "file1.txt": "widget",
 "file2.txt": "sprocket", #etc
}

If each file is not too large you can read each file into memory:

for filename in os.listdir(os.getcwd()):
    string = d[filename]
    if string in open(filename, 'r').read():
        print(filename, string, "1")
    else: 
        print(filename, string, "0")

This example uses print, but you could write to a file instead. Open the output file before the loop outfile = open("outfile.txt", 'w') and instead of printing use

outfile.write("{} {} {}\n".format(filename, string, 1))

On the other hand, if each file is too large to fit easily into memory, you could use a mmap as described in Search for string in txt file Python

Community
  • 1
  • 1
James K
  • 3,692
  • 1
  • 28
  • 36
  • I ran this and got this error: string = dict[filename] TypeError: 'type' object is not subscriptable – prizmracer11 Sep 12 '16 at 19:06
  • 2
    thats because one should never use dict for the name of a dict. – James K Sep 12 '16 at 19:09
  • Ok fixed that, I had to change the line for missing keys and I copy and pasted the dict. I redid that using d.get(filename, "!@#$%^&*") the random string is just silly way to mark the missing keys. I also had to add a () to the read from if string in open(filename, 'r').read to get it to run. – prizmracer11 Sep 12 '16 at 19:26
  • An `mmap` solution is by far the best if: 1) The files are smaller than a GB or so _or_ you're using a 64 bit build of Python (in which case file size doesn't matter) and 2) The files are large, and the search string is often found early in the file. If #2 is false, `mmap` won't hurt, but it won't help much either. If #2 is true, then you might short-circuit quite a lot of I/O. On Py3, you might also benefit from [using `os.posix_fadvise`](https://docs.python.org/3/library/os.html#os.posix_fadvise) on the file object's `fileno()` (passing appropriate `WILLNEED`, `SEQUENTIAL` or `NOREUSE`). – ShadowRanger Sep 12 '16 at 19:47