I am trying to find out which way is the most memory and computation efficient one for searching strings in large files.
The three methods I can think of are
Reading line by line and stopping when the string is found.
The standard way for small files:
string in open(filename).read()
Using Memory-mapping as suggested here. This answer also explains that that should resolve possible memory problems by not reading the file into memory by searching directly in the underlying file.
From this, what should be expected is that 1. should be fast and memory efficient if the search string is encountered early, otherwise slow, 2. should be really slow and memory consuming, and 3. should be very memory efficient and also fast if the search string is encountered early, slow otherwise.
I used this large (164MB) file from the USPTO and the following script to test this:
import time
import mmap
@profile
def test1(string, filename):
with open(filename, "r") as infile:
for line in infile:
if string in line:
return True
return False
@profile
def test2(string, filename):
return string in open(filename).read()
@profile
def test3(string, filename):
# from https://stackoverflow.com/a/4944929/1735215
with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
if s.find(bytes(string, "utf-8")) != -1:
return True
return False
def test_performance():
teststrings = ['PATDOC', 'chargeable program.</PDAT></PTEXT>']
testfile = 'pg020212.XML'
for teststring in teststrings:
t0 = time.time()
print(test1(teststring, testfile))
t1 = time.time()
print(t1 - t0)
t0 = time.time()
print(test2(teststring, testfile))
t1 = time.time()
print(t1 - t0)
t0 = time.time()
print(test3(teststring, testfile))
t1 = time.time()
print(t1 - t0)
test_performance()
The script tests, time profiles, and memory profiles all three methods first for a string that is found early in the file, then with another one that is only found at the end. What happens is:
$ python -m memory_profiler search_efficiency_test.py
True
0.0007505416870117188 # Method 1, string early in the file
True
0.47335124015808105 # Method 2, string early in the file
True
0.00037598609924316406 # Method 3, string early in the file
True
171.73131465911865 # Method 1, string at the bottom of the file
True
0.5401151180267334 # Method 2, string at the bottom of the file
True
0.15198254585266113 # Method 3, string at the bottom of the file
Filename: /home/user/search_efficiency_test.py
Line # Mem usage Increment Line Contents
================================================
4 32.652 MiB 65.289 MiB @profile
5 def test1(string, filename):
6 32.652 MiB 0.000 MiB with open(filename, "r") as infile:
7 32.652 MiB 0.000 MiB for line in infile:
8 32.652 MiB 0.000 MiB if string in line:
9 32.652 MiB 0.000 MiB return True
10 return False
Filename: /home/user/search_efficiency_test.py
Line # Mem usage Increment Line Contents
================================================
12 32.652 MiB 65.289 MiB @profile
13 def test2(string, filename):
14 32.695 MiB 0.059 MiB return string in open(filename).read()
Filename: /home/user/search_efficiency_test.py
Line # Mem usage Increment Line Contents
================================================
16 32.695 MiB 65.348 MiB @profile
17 def test3(string, filename):
18 # from https://stackoverflow.com/a/4944929/1735215
19 32.695 MiB 0.000 MiB with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
20 193.449 MiB 160.754 MiB if s.find(bytes(string, "utf-8")) != -1:
21 32.695 MiB -160.754 MiB return True
22 return False
So:
For strings found early in the file method 2 takes longer than 1 and 3, but with .5 seconds for a file of this size not outrageously long
For strings encountered only at the end (or not at all), method 1 takes really really long, 2 and 3 both are fast with 3 being faster than 2.
1 and 2 do not seem to use any substantial memory while methor 3, supposedly more memory efficient, uses huge quantities of memory.
What is going on? Am I misunderstanding the how these methods work? Why would method 1 take longer than method 2? Why do method 1 and 2 use less memory than 3? Or is the memory-profiler not to be trusted?
Software is: Linux kernel 4.15.15, python 3.6.4, memory-profiler 0.52.0 (installed via pip).
Edit:
Following @Barmar's suggestion I ran the tests again for the three methods separately (the other two commented out each time. However, the test results do not seem to change very much. To make sure that the bad performance of method 1 is not caused by the file somehow remaining in memory when the script terminates, I ran it in a different order (test 2 first, then 1, then 3):
$ python -m memory_profiler search_efficiency_test.py
True
1.2371280193328857
True
0.5623576641082764
Filename: /home/user/search_efficiency_test.py
Line # Mem usage Increment Line Contents
================================================
12 32.578 MiB 32.578 MiB @profile
13 def test2(string, filename):
14 32.578 MiB 0.000 MiB return string in open(filename, "r", encoding="utf8", errors='ignore').read()
$ python -m memory_profiler search_efficiency_test.py
True
0.000804901123046875
True
178.5931453704834
Filename: /home/user/search_efficiency_test.py
Line # Mem usage Increment Line Contents
================================================
4 32.582 MiB 32.582 MiB @profile
5 def test1(string, filename):
6 32.582 MiB 0.000 MiB with open(filename, "r") as infile:
7 32.582 MiB 0.000 MiB for line in infile:
8 32.582 MiB 0.000 MiB if string in line:
9 32.582 MiB 0.000 MiB return True
10 return False
$ python -m memory_profiler search_efficiency_test.py
True
0.0006160736083984375
True
0.16133618354797363
Filename: /home/user/search_efficiency_test.py
Line # Mem usage Increment Line Contents
================================================
16 32.688 MiB 32.688 MiB @profile
17 def test3(string, filename):
18 # from https://stackoverflow.com/a/4944929/1735215
19 32.688 MiB 0.000 MiB with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
20 193.281 MiB 160.594 MiB if s.find(bytes(string, "utf-8")) != -1:
21 32.688 MiB -160.594 MiB return True
22 return False