How to iterate over a file with huge no of lines to find a string?

Question

Python:

I have a particular word which Iam looking in a text file of large size which is in millions of records.

So actually i wanted to search if a particular string is availabe in the file.

One way i did is :

with open('ip.log', 'r') as f:
     for line in f:
         if semething in line:
            break
     else: 
         print 'Not found'

But for small files this process will be fine,but when file size increases or records grow to tens of millions.Loading that big file into memory may not be a feasible solution.

Is there any better way to deal with this problem?

Observations:

If file is so huge of 1GB or something, it will slow up the system
Looking for one text we need to iterate over millions of records each time.

Have you ran a benchmark? file objects are generator-like objects which means that they won't get loaded into the memory at once! — Mazdak, Nov 14 '17 at 06:52
Your observation #2 doesn't quite make sense. If you're looking for one text then you iterate over millions of records one time. However, if you _do_ need to search for multiple texts it makes sense to look for them all at once, when practical, rather than performing multiple searches. — PM 2Ring, Nov 14 '17 at 06:57
Your code is not reading the entire file into memory at once. It processes a single line, then forgets it. — tripleee, Nov 14 '17 at 07:00
If you are on unix you could delegate the job to `grep` which I'd expect to be faster. — Paul Panzer, Nov 14 '17 at 07:07
@PaulPanzer Good idea, but not helping if you need, e.g. the content of the following lines as well or, as mentioned above, multiple strings to be searched. — mikuszefski, Nov 14 '17 at 07:18
@mikuszefski You must have a pretty bad grep if it can't do those things. — PM 2Ring, Nov 14 '17 at 07:21
@mikuszefski As a matter of fact `grep` can match multiple strings and can output matches in context. (`-F` and `-C` options) — Paul Panzer, Nov 14 '17 at 07:26
@PaulPanzer Yeap just figuring that out myself...like -A and -B etc.....definitively not using this often enough....Cheers — mikuszefski, Nov 14 '17 at 07:29
@PaulPanzer ....yep,...got it e.g. `grep -i -A 2 'firststring\|other' bigfile.txt`...does what I mentioned above. Happy that I learned something new....already a good day — mikuszefski, Nov 14 '17 at 07:37

score 1 · Accepted Answer · answered Nov 14 '17 at 07:50

Your worries are unfounded: your code does not load the entire file into memory-- try it! And the break in your loop will stop reading the file as soon as it finds the word you search for, so that's not a worry either.

In your code, f is a file object that reads one line at a time when used with a for-loop. If you had written f.readlines() or f.read(), then you would be reading the entire file.

The only potential problem is if your files do not contain newlines (e.g. if they are binary files, or enormous lists of words separated by spaces rather than newlines). In that case, you'd need to read blocks of characters with something like f.read(10000) (and deal with words being broken across blocks). Since your use case involves regular text files, there's no need to worry about that.

score 0 · Answer 2 · answered Nov 14 '17 at 07:19

Use the any function. It will stop on the first match, and won't load the whole file into memory. It is pretty efficient.

with open('ip.log', 'r') as f:
    if any(line for line in f if something in line):
        break
    else: 
        print 'Not found'

How to iterate over a file with huge no of lines to find a string?

2 Answers2