3

I want this code to work fast.

import re
with open('largetextfile.txt') as f:
    for line in f:
        pattern = re.compile("^1234567")
        if pattern.match(line):
            print (line)

takes 19 seconds.

I modified it:

import re
with open('largetextfile.txt') as f:
    for line in f:
        if "1234567" in line:
            pattern = re.compile("^1234567")
            if pattern.match(line):
                print (line)

takes 7 seconds.

So the question is, is there any better way?

I got two ideas from community and based on that I asked the detailed question at: https://codereview.stackexchange.com/questions/135159/python-search-for-array-in-large-text-file

Community
  • 1
  • 1
Rahul
  • 10,830
  • 4
  • 53
  • 88

3 Answers3

4

Check if this matches your requirement:

with open('largetextfile.txt') as f:
    for line in f:
        if line.startswith('1234567'):
            print line
shiva
  • 2,535
  • 2
  • 18
  • 32
1

Since you're matching a string you don't need regular expressions, so you can use this

with open('bigfile.txt') as f:
    for line in f:     
        if line[:7]=="1234567": 
            print (line)

I noticed that using string slicing is slightly faster than startswith and found out this has been discussed here

Community
  • 1
  • 1
user2314737
  • 27,088
  • 20
  • 102
  • 114
1

In order to perform tests, I copied in a file AAA.txt the following text of 6,31 MB and around 128.000 lines:
http://norvig.com/big.txt
Then with the help of random module, I changed it to a file BBB.txt by randomly inserting '1234567' at the starts of 1000 lines of it.

I tested several solutions on this modified text.

I can't discriminate which one of the following ones is the fastest, but I think they're all faster than other solutions that I read in this page and other solutions of mine.

They are based on the fact that the "in"-test 'string' in 'anotherstring' is tremendously fast.

def in_and_startswith(x):
    return '1234567' in x and x.startswith('1234567')
with open('BBB.txt') as f:
    for line in filter(in_and_startswith, f):
        x=0

.

def in_and_find(x):
    return '1234567' in x and x.find('1234567')==0
with open('BBB.txt') as f:
    for line in filter(in_and_find, f):
        x=0

.

def just_in(x):
    return '1234567' in x

with open('BBB.txt') as f:
    for line in filter(just_in, f):
        if line.startswith('1234567'):
            x=0

with open('BBB.txt') as f:
    for line in filter(just_in, f):
        if line.find('1234567')==0:
            x=0

Note that I tested with just the instruction x=0 that has no particular sense, to avoid instruction print(line) because print() is an instruction that takes a long time to execute. So repeating several print() instructions is much longer than printing just one string obtained as joining all the strings that must be printed.

Test the execution times of

for x in ['hkjh','kjhoi','3135487j','kjhskdkfh','54545779']:
    print(x)

and

print('\n'.join(x for x i['hkjh','kjhoi','313587j','kjhskdkfh','54545779']))

you'll see the difference

eyquem
  • 26,771
  • 7
  • 38
  • 46