superfast regexmatch in large text file

Question

I want this code to work fast.

import re
with open('largetextfile.txt') as f:
    for line in f:
        pattern = re.compile("^1234567")
        if pattern.match(line):
            print (line)

takes 19 seconds.

I modified it:

import re
with open('largetextfile.txt') as f:
    for line in f:
        if "1234567" in line:
            pattern = re.compile("^1234567")
            if pattern.match(line):
                print (line)

takes 7 seconds.

So the question is, is there any better way?

I got two ideas from community and based on that I asked the detailed question at: https://codereview.stackexchange.com/questions/135159/python-search-for-array-in-large-text-file

Small thing to change is to take the pattern definition out of the loop — user2314737, Jul 16 '16 at 12:04
Maybe instead of checking `if "1234567" in line:` chech if the first 7 characters in line are equal to "1234567" as a string (no `in`). — user2314737, Jul 16 '16 at 12:09
You are compiling the pattern on each iteration. Take it out of the loop. — chapelo, Jul 16 '16 at 13:08

score 4 · Accepted Answer · answered Jul 16 '16 at 12:10

4

Check if this matches your requirement:

with open('largetextfile.txt') as f:
    for line in f:
        if line.startswith('1234567'):
            print line

answered Jul 16 '16 at 12:10

shiva

2,535
2
18
32

@scripting.filesystemobject can you check time on this answer. I'm guessing it is much faster than yours – joel goldstick Jul 16 '16 at 12:13
Almost half than my code. I will validate it. Thanks – Rahul Jul 16 '16 at 12:14

score 1 · Answer 2 · edited May 23 '17 at 12:17

1

Since you're matching a string you don't need regular expressions, so you can use this

with open('bigfile.txt') as f:
    for line in f:     
        if line[:7]=="1234567": 
            print (line)

I noticed that using string slicing is slightly faster than startswith and found out this has been discussed here

edited May 23 '17 at 12:17

Community

1
1

answered Jul 16 '16 at 13:15

user2314737

27,088
20
102
114

Thanks. It is faster. – Rahul Jul 18 '16 at 03:54

score 1 · Answer 3 · answered Jul 16 '16 at 14:49

In order to perform tests, I copied in a file AAA.txt the following text of 6,31 MB and around 128.000 lines:
http://norvig.com/big.txt
Then with the help of random module, I changed it to a file BBB.txt by randomly inserting '1234567' at the starts of 1000 lines of it.

I tested several solutions on this modified text.

I can't discriminate which one of the following ones is the fastest, but I think they're all faster than other solutions that I read in this page and other solutions of mine.

They are based on the fact that the "in"-test 'string' in 'anotherstring' is tremendously fast.

def in_and_startswith(x):
    return '1234567' in x and x.startswith('1234567')
with open('BBB.txt') as f:
    for line in filter(in_and_startswith, f):
        x=0

.

def in_and_find(x):
    return '1234567' in x and x.find('1234567')==0
with open('BBB.txt') as f:
    for line in filter(in_and_find, f):
        x=0

.

def just_in(x):
    return '1234567' in x

with open('BBB.txt') as f:
    for line in filter(just_in, f):
        if line.startswith('1234567'):
            x=0

with open('BBB.txt') as f:
    for line in filter(just_in, f):
        if line.find('1234567')==0:
            x=0

Note that I tested with just the instruction x=0 that has no particular sense, to avoid instruction print(line) because print() is an instruction that takes a long time to execute. So repeating several print() instructions is much longer than printing just one string obtained as joining all the strings that must be printed.

Test the execution times of

for x in ['hkjh','kjhoi','3135487j','kjhskdkfh','54545779']:
    print(x)

and

print('\n'.join(x for x i['hkjh','kjhoi','313587j','kjhskdkfh','54545779']))

you'll see the difference

It will take some time for me to understand. – Rahul Jul 18 '16 at 03:52 — Rahul, Jul 18 '16 at 03:52

superfast regexmatch in large text file

3 Answers3