I have developed some working code but it is extremely slow in its method. I am trying to search a huge text file for 1000s of strings using my dictionary keys as the search strings.
Here's my working code...
for root, subFolders, files in os.walk(path):
for filename in files:
if filename.endswith('.txt'):
with open(os.path.join(root, filename), 'r') as f:
print '\tProcessing file: '+filename
for line in f:
if 'KEY_FRAME details' in line:
chunk = [next(f) for x in xrange(5)]
FRAME = chunk[5].split()
FRAME = FRAME[2]
framelist.append([FRAME])
newdict = dict([(d[0], d[1:]) for d in framelist])
f.close()
with open(os.path.join(root, filename), 'r') as f:
for line in f:
if any(['FRAME = '+str(i) in line for i in newdict.keys()]):
...do more text processing based on following lines and append to frame key...
The txt file is too large to read directly into memory, so I perform two searches on the same file, the first search is to collect frame numbers of interest based on a particular string header 'KEY_FRAME' and put the found frames into a list. I then convert this list into a dict using the frame numbers as keys and close the file.
I then re-open the file and perform a search for 'any' of the dict keys ('frame number') for each line in the file based on a new search string 'FRAME = '+str(Frame number). But whilst the method works it is extremely slow.
I had thought of some sort of toggle of search strings during the initial file read, but some of the 'FRAME = '+str(Frame number) strings appear before the initial 'KEY_FRAME details' string in the file
Is there a more efficient method for the above (albeit rudimentary) code?
Thanks for reading.
After looking into a suggested regex solution, I have changed the end of my script to the following...
framelist.append(str(FRAME))
f.seek(0)
framed = re.compile('|'.join(framelist))
framed = framed.pattern
sentences = f
for s in sentences:
if any(('FrameNumber = '+f) in s for f in framelist):
print 'first found'
if any(('FN = '+f) in s for f in framelist):
print 'second found'
The latest addition has seen some improvements in the processing speed (3 mins per log), so tolerable, though I was hoping for slightly better, but then I do have more than 6000 frame numbers to search for each time.