0

What are the "best" ways to search for a occurrence of a string in a large number of text files, using python?

As I understand it we can use the following:

for f in files:
    with open("file.txt") as f:
        for line in f:
            # do stuff

Python caches the file in chunks under the hood and therefore the IO penalty is WAY less severe than it looks like at first glance. This is my go-to if I had to read a few files at most.

But I can also do the following in the case of a list of files(or os.walk):

for f in files:
    with open("file.txt") as f:
        lines = list(f)
    for line in lines:
        #do stuff
    # Or a variation on this

If I have hundreds of files to read I'd like to load them all up into memory before scanning them. The logic here is to keep file access time to a minimum(and let the OS its filesystem magic) and keep the logic minimal it since IO is often the bottleneck. It's obviously going to cost way more memory, but will it improve performance?

Are my assumptions correct here and/or are there better ways of doing this? If there's no clear answer what would be the best way to measure this in python?

Envops
  • 23
  • 4
  • " but will it improve performance?" Well, that's an *empirical question*. Did you *profile it*? – juanpa.arrivillaga Aug 25 '20 at 08:40
  • This depends on so many factors beyond your control - such as the size of files, file-system and Python caching, memory size, etc. - that you're probably better off just using the straight-forward method and letting the system take care of the rest. If you DO need to squeeze the last ounce of performance from your program, then use profiling, but my feeling is that you'll be wasting more time trying to optimise your program than you stand to gain from an optimal solution. – Mario Camilleri Aug 25 '20 at 08:55

2 Answers2

1

is that premature optimization ?

did You actually profile the whole process, is there really a need to speed it up ? see: https://stackify.com/premature-optimization-evil/

if You really HAVE the need to speed it up, You should consider some threaded approach, since it is I/O bound.

one easy way is, to use ThreadPoolExecutor, see : https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor

another way (if You are on linux) is just to execute some shell command like 'find', 'grep' etc. - those little C-programs are highly optimized and will be for sure the fastest solution. You might use Python to wrap those commands.

Regexp is not faster, as @Abdul Rahman Ali stated incorrectly:

$ python -m timeit '"aaaa" in "bbbaaaaaabbb"'
10000000 loops, best of 3: 0.0767 usec per loop
$ python -m timeit -s 'import re; pattern = re.compile("aaaa")' 'pattern.search("bbbaaaaaabbb")'
1000000 loops, best of 3: 0.356 usec per loop
bitranox
  • 1,664
  • 13
  • 21
  • I understand multithreading the logic part operating on the data, but I fail to see how it would help with the IO portion. Could you elaborate? Also using command line tools is not a generic solution. I want to keep it simple – Envops Aug 25 '20 at 09:30
  • @envops - while we wait for the I/O to complete, we can search the strings, or we can open a second I/O operation in another thread. If You read from one single disk, of course the read operation is intrinsically serialized. I would suggest to google for literature about that here on stack overflow. And there is nothing wrong with using commandline tools - You can have two different versions, - one for windows, one for linux – bitranox Aug 25 '20 at 14:03
0

The best way to search for a pattern in a text is to use Regular Expressions:

import re
f = open('folder.txt')
list_of_wanted_word=list()
for line in f:
    wanted_word=re.findall('(^[a-z]+)',l)  #find a text in a line and extract it
        for k in wanted_word:#putting the word in a list
            list_of_wanted_word.append(k)
print(list_of_wanted_word)
bitranox
  • 1,664
  • 13
  • 21
  • have You profiled it ? It might not be really faster !, check : https://stackoverflow.com/questions/19911508/python-speed-for-in-vs-regular-expression everyone claims that regexp is slower in that case, but again - You need to profile it. – bitranox Aug 25 '20 at 08:55
  • If you specify the words you want to extract or find it using regular expressions accurately, the process will be very fast – Abdul Rahman Ali Aug 25 '20 at 09:01