Fastest way to get line with sub string from a file in python

Question

I have a list of files I want to extract a line containing a particular string from each of these files. What would be the fastest way to do this? Eg.

File:

fisrt line 
second line 
some gossip is innate
smush smush
squish bust
although
last line

I want the line containing gossip and therefore should get

some gossip is innate

in return.

I am looking for something that can emulate the performance of grep from a bash shell in python.

Did you try something? Was it not fast enough? Or what was the problem? — mkrieger1, May 22 '20 at 09:19
Use the `in` operator or a regular expression with word boundaries. The former will be faster while the latter more accurate. — Jan, May 22 '20 at 09:20
@mkrieger1 I used the method to open a file, save all the lines using `readlines` and then searching for the line containing the string. The approach seemed rather inefficient to me and I am looking for a better one at the same — fireball.1, May 22 '20 at 09:22
See https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-in-python-line-by-line-without-loading-it-into — mkrieger1, May 22 '20 at 09:23

score 0 · Answer 1 · answered May 22 '20 at 09:27

As for timing issues, make yourself comfortable with the timeit module.
For your specific problem, you have two choices, the in operator and a more accurate regex approach:

import re, timeit

content = """
fisrt line 
second line 
some gossip is innate
smush smush
squish bust
although
last line
"""

def only_string_functions():
    return [line for line in content.split("\n") if "gossip" in line]

pattern = re.compile(r'\bgossip\b')
def regex_approach():
    return [line for line in content.split("\n") if pattern.search(line)]

print(timeit.timeit(only_string_functions, number=10**5))
print(timeit.timeit(regex_approach, number=10**5))

Running this a 100.000 times, it yields on my MacBook:

0.11374067
0.40804803300000003

So, as expected, the in operator is by far faster (about three times) than the regex approach but will give you lines like mygossip should not be matched as well - this may or may not be a problem.

This approach stores the contents of the file in a variable and processes on them. As mentioned in the question, I am looking for an approach like `grep` from bash where I can save time on reading all of the file and then iterating through the lines in this list of lines. — fireball.1, May 22 '20 at 09:36

Fastest way to get line with sub string from a file in python

1 Answers1