1

I need to find a pattern in a text file, which isn't big. Therefore loading the entire file into RAM isn't a concern for me - as advised here:

I tried to do it in two ways:

with open(inputFile, 'r') as file:
    for line in file.readlines():
        for date in dateList:
            if re.search('{} \d* 1'.format(date), line):

OR

with open(inputFile, 'r') as file:
    contents = file.read()
    for date in dateList:
        if re.search('{} \d* 1'.format(date), contents):

The second one proved to be much faster.

Is there an explanation for this, other than the fact that I am using one less loop with the second approach?

Ghost Ops
  • 1,710
  • 2
  • 13
  • 23
  • 3
    It is clear that you perform a single match per whole file text in the second case (`re.search` only looks for the first match). In the first case, you run it as many times as there are lines in a file. The first code snippet is bound to take more time. – Wiktor Stribiżew Oct 06 '21 at 08:43
  • Correct. I missed the multiple calls - Thanks – Ran Elkayam Oct 06 '21 at 08:51

1 Answers1

0

As pointed out in the comments, the two codes are not equivalent as the second one only look for the first match in the whole file. Besides this, the first is also more expensive because the (relatively expensive) format over all dates is called for each line. Storing the regexp and precompiling them should help a lot. Even better: you can generate a regexp to match all the dates at once using something like:

regexp = '({}) \d* 1'.format('|'.join('{}'.format(date) for date in dateList))

with open(inputFile, 'r') as file:
    contents = file.read()
    # Search the first matching date existing in dateList
    if re.search(regexp, contents):

Note that you can use findall if you want all of them.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59