What are the "best" ways to search for a occurrence of a string in a large number of text files, using python?
As I understand it we can use the following:
for f in files:
with open("file.txt") as f:
for line in f:
# do stuff
Python caches the file in chunks under the hood and therefore the IO penalty is WAY less severe than it looks like at first glance. This is my go-to if I had to read a few files at most.
But I can also do the following in the case of a list of files(or os.walk):
for f in files:
with open("file.txt") as f:
lines = list(f)
for line in lines:
#do stuff
# Or a variation on this
If I have hundreds of files to read I'd like to load them all up into memory before scanning them. The logic here is to keep file access time to a minimum(and let the OS its filesystem magic) and keep the logic minimal it since IO is often the bottleneck. It's obviously going to cost way more memory, but will it improve performance?
Are my assumptions correct here and/or are there better ways of doing this? If there's no clear answer what would be the best way to measure this in python?