1

I am trying to find out which way is the most memory and computation efficient one for searching strings in large files.

The three methods I can think of are

  1. Reading line by line and stopping when the string is found.

  2. The standard way for small files: string in open(filename).read()

  3. Using Memory-mapping as suggested here. This answer also explains that that should resolve possible memory problems by not reading the file into memory by searching directly in the underlying file.

From this, what should be expected is that 1. should be fast and memory efficient if the search string is encountered early, otherwise slow, 2. should be really slow and memory consuming, and 3. should be very memory efficient and also fast if the search string is encountered early, slow otherwise.

I used this large (164MB) file from the USPTO and the following script to test this:

import time
import mmap

@profile
def test1(string, filename):
    with open(filename, "r") as infile:
        for line in infile:
            if string in line:
                return True
    return False

@profile
def test2(string, filename):
    return string in open(filename).read()

@profile
def test3(string, filename):
    # from https://stackoverflow.com/a/4944929/1735215
    with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
        if s.find(bytes(string, "utf-8")) != -1:
            return True
    return False

def test_performance():
    teststrings = ['PATDOC', 'chargeable program.</PDAT></PTEXT>'] 
    testfile = 'pg020212.XML'

    for teststring in teststrings:
        t0 = time.time()
        print(test1(teststring, testfile))
        t1 = time.time()
        print(t1 - t0)

        t0 = time.time()
        print(test2(teststring, testfile))
        t1 = time.time()
        print(t1 - t0)

        t0 = time.time()
        print(test3(teststring, testfile))
        t1 = time.time()
        print(t1 - t0)

test_performance()

The script tests, time profiles, and memory profiles all three methods first for a string that is found early in the file, then with another one that is only found at the end. What happens is:

$ python -m memory_profiler search_efficiency_test.py
True
0.0007505416870117188                 # Method 1, string early in the file
True
0.47335124015808105                   # Method 2, string early in the file
True
0.00037598609924316406                # Method 3, string early in the file
True
171.73131465911865                    # Method 1, string at the bottom of the file
True
0.5401151180267334                    # Method 2, string at the bottom of the file
True
0.15198254585266113                   # Method 3, string at the bottom of the file
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
     4   32.652 MiB   65.289 MiB   @profile
     5                             def test1(string, filename):
     6   32.652 MiB    0.000 MiB       with open(filename, "r") as infile:
     7   32.652 MiB    0.000 MiB           for line in infile:
     8   32.652 MiB    0.000 MiB               if string in line:
     9   32.652 MiB    0.000 MiB                   return True
    10                                 return False


Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    12   32.652 MiB   65.289 MiB   @profile
    13                             def test2(string, filename):
    14   32.695 MiB    0.059 MiB       return string in open(filename).read()


Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    16   32.695 MiB   65.348 MiB   @profile
    17                             def test3(string, filename):
    18                                 # from https://stackoverflow.com/a/4944929/1735215
    19   32.695 MiB    0.000 MiB       with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
    20  193.449 MiB  160.754 MiB           if s.find(bytes(string, "utf-8")) != -1:
    21   32.695 MiB -160.754 MiB               return True
    22                                 return False

So:

  • For strings found early in the file method 2 takes longer than 1 and 3, but with .5 seconds for a file of this size not outrageously long

  • For strings encountered only at the end (or not at all), method 1 takes really really long, 2 and 3 both are fast with 3 being faster than 2.

  • 1 and 2 do not seem to use any substantial memory while methor 3, supposedly more memory efficient, uses huge quantities of memory.

What is going on? Am I misunderstanding the how these methods work? Why would method 1 take longer than method 2? Why do method 1 and 2 use less memory than 3? Or is the memory-profiler not to be trusted?

Software is: Linux kernel 4.15.15, python 3.6.4, memory-profiler 0.52.0 (installed via pip).

Edit:

Following @Barmar's suggestion I ran the tests again for the three methods separately (the other two commented out each time. However, the test results do not seem to change very much. To make sure that the bad performance of method 1 is not caused by the file somehow remaining in memory when the script terminates, I ran it in a different order (test 2 first, then 1, then 3):

$ python -m memory_profiler search_efficiency_test.py
True
1.2371280193328857
True
0.5623576641082764
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    12   32.578 MiB   32.578 MiB   @profile
    13                             def test2(string, filename):
    14   32.578 MiB    0.000 MiB       return string in open(filename, "r", encoding="utf8", errors='ignore').read()


$ python -m memory_profiler search_efficiency_test.py
True
0.000804901123046875
True
178.5931453704834
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
     4   32.582 MiB   32.582 MiB   @profile
     5                             def test1(string, filename):
     6   32.582 MiB    0.000 MiB       with open(filename, "r") as infile:
     7   32.582 MiB    0.000 MiB           for line in infile:
     8   32.582 MiB    0.000 MiB               if string in line:
     9   32.582 MiB    0.000 MiB                   return True
    10                                 return False


$ python -m memory_profiler search_efficiency_test.py
True
0.0006160736083984375
True
0.16133618354797363
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    16   32.688 MiB   32.688 MiB   @profile
    17                             def test3(string, filename):
    18                                 # from https://stackoverflow.com/a/4944929/1735215
    19   32.688 MiB    0.000 MiB       with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
    20  193.281 MiB  160.594 MiB           if s.find(bytes(string, "utf-8")) != -1:
    21   32.688 MiB -160.594 MiB               return True
    22                                 return False
0range
  • 2,088
  • 1
  • 24
  • 32
  • You're using the same file in all three tests. Test 1 will read much of the file into memory, which speeds up tests 2 and 3. – Barmar Jun 21 '18 at 20:18
  • Benchmarking programs that access files is tricky, because you need to ensure that all test cases start with the same initial conditions (none of the file is in memory). – Barmar Jun 21 '18 at 20:19
  • @Barmar This is an interesting point, thanks. I tried to address this by running the script separately for all three methods (see the edit in the question). Somehow this does not seem to change anything. – 0range Jun 21 '18 at 20:37
  • It doesn't matter if you run them separately. The file stays in memory unless something pushes it out. – Barmar Jun 21 '18 at 20:38
  • Use different files for each test. And before running any of the tests, do something that uses lots of memory to force them out of memory. – Barmar Jun 21 '18 at 20:39

0 Answers0