1

I have developed some working code but it is extremely slow in its method. I am trying to search a huge text file for 1000s of strings using my dictionary keys as the search strings.

Here's my working code...

for root, subFolders, files in os.walk(path):
    for filename in files:  
        if filename.endswith('.txt'): 
            with open(os.path.join(root, filename), 'r') as f:
                print '\tProcessing file: '+filename


                for line in f:
                    if 'KEY_FRAME details' in line:
                        chunk = [next(f) for x in xrange(5)]
                        FRAME = chunk[5].split()
                        FRAME = FRAME[2]

                        framelist.append([FRAME])

            newdict = dict([(d[0], d[1:]) for d in framelist])

            f.close()

            with open(os.path.join(root, filename), 'r') as f:
                for line in f:
                    if any(['FRAME = '+str(i) in line for i in newdict.keys()]):
                        ...do more text processing based on following lines and append to frame key...

The txt file is too large to read directly into memory, so I perform two searches on the same file, the first search is to collect frame numbers of interest based on a particular string header 'KEY_FRAME' and put the found frames into a list. I then convert this list into a dict using the frame numbers as keys and close the file.

I then re-open the file and perform a search for 'any' of the dict keys ('frame number') for each line in the file based on a new search string 'FRAME = '+str(Frame number). But whilst the method works it is extremely slow.

I had thought of some sort of toggle of search strings during the initial file read, but some of the 'FRAME = '+str(Frame number) strings appear before the initial 'KEY_FRAME details' string in the file

Is there a more efficient method for the above (albeit rudimentary) code?

Thanks for reading.

After looking into a suggested regex solution, I have changed the end of my script to the following...

                        framelist.append(str(FRAME))


                f.seek(0)

                framed = re.compile('|'.join(framelist))
                framed = framed.pattern

                sentences = f
                for s in sentences:
                    if any(('FrameNumber = '+f) in s for f in framelist):
                        print 'first found'

                    if any(('FN = '+f) in s for f in framelist):
                        print 'second found'

The latest addition has seen some improvements in the processing speed (3 mins per log), so tolerable, though I was hoping for slightly better, but then I do have more than 6000 frame numbers to search for each time.

MikG
  • 1,019
  • 1
  • 15
  • 34
  • have you tried to use tuples instead of lists for the list comprehension bits? – markcial Jan 13 '15 at 12:27
  • Do I see it right that chunk has five elements and you want to split the sixth? – Marco de Wit Jan 13 '15 at 13:11
  • 1
    If you open a file with `with`, you do not need to close it explicitly. – Marco de Wit Jan 13 '15 at 13:15
  • 1
    Shouldn't you empty `framelist` for every file? – Marco de Wit Jan 13 '15 at 13:17
  • @MarcodeWit: Not emptying `framelist` will _certainly_ slow things down. OTOH, the OP may be accumulating details from the files that are being processed. – PM 2Ring Jan 13 '15 at 13:45
  • Hi all, thanks for the feedback, unfortunately after I tried your suggestions with emptying framelist, initialising 'keys' outside the loop and replacing file open with f.seek(0) I still see slow progress of the script. Strange it's so slow though – MikG Jan 13 '15 at 15:54
  • @MikG: In that case, maybe you should try out the regex method mentioned in the link in my answer. – PM 2Ring Jan 14 '15 at 13:00

1 Answers1

0

There are a couple of things you could do to improve the speed.

1) Don't close & re-open the file: you can rewind the file pointer to the start using f.seek(0)

2) You are rebuilding your list of strings for every line you check in the second for line in f: loop. You should do that outside the loop, eg

keys = ['FRAME = '+str(i) for i in newdict.keys()]

or even

keys = ['FRAME = '+i for i in newdict.keys()]

because it looks to me that the keys of newdict are already strings.

And then your test would become

if any(key in line for key in keys):

That test is using a generator expression instead of a list comprehension, so any() can bail out as soon as it's found a match, which is generally much faster than building the whole list of matches before testing them.

Actually, you may be able to improve on this matching process by using regexes, as illustrated in this answer.

Community
  • 1
  • 1
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • Hi @PM 2Ring, I manged to use your suggested method using your regex method...or at least I think I did what you suggested. There seems to be a little improvement in the speed, it now only takes 4 mins to process each log file, however it's still a bit slower than I was hoping for. I guess it's simply down to the large number of frame number I have to search through (+6000), so not sure if there is anything else possible to help with the speed. – MikG Jan 16 '15 at 08:46
  • @MikG: Maybe you should post your updated code on [Code Review](http://codereview.stackexchange.com/), as code optimization questions are far more suitable there than they are here. You will need to post a fully working program (with some sample data) that others can run & test, but that doesn't mean you need to post your complete program. Just make a cut-down program that focuses on the code that needs optimizing. – PM 2Ring Jan 16 '15 at 09:16
  • Thanks for the mention about Code Review, I have posted there just now, so fingers crossed – MikG Jan 21 '15 at 15:43