5

I posted a similar question a few days ago but without any code, now I created a test code in hopes of getting some help.

Code is at the bottom.

I got some dataset where I have a bunch of large files (~100) and I want to extract specific lines from those files very efficiently (both in memory and in speed).

My code gets a list of relevant files, the code opens each file with [line 1], then maps the file to memory with [line 2], also, for each file I receives a list of indices and going over the indices I retrieve the relevant information (10 bytes for this example) like so: [line 3-4], finally I close the handles with [line 5-6].

binaryFile = open(path, "r+b")
binaryFile_mm = mmap.mmap(binaryFile.fileno(), 0)
for INDEX in INDEXES:
    information = binaryFile_mm[(INDEX):(INDEX)+10].decode("utf-8")
binaryFile_mm.close()
binaryFile.close()

This codes runs in parallel, with thousands of indices for each file, and continuously do that several times a second for hours.

Now to the problem - The code runs well when I limit the indices to be small (meaning - when I ask the code to get information from the beginning of the file). But! when I increase the range of the indices, everything slows down to (almost) a halt AND the buff/cache memory gets full (I'm not sure if the memory issue is related to the slowdown).

So my question is why does it matter if I retrieve information from the beginning or the end of the file and how do I overcome this in order to get instant access to information from the end of the file without slowing down and increasing buff/cache memory use.

PS - some numbers and sizes: so I got ~100 files each about 1GB in size, when I limit the indices to be from the 0%-10% of the file it runs fine, but when I allow the index to be anywhere in the file it stops working.

Code - tested on linux and windows with python 3.5, requires 10 GB of storage (creates 3 files with random strings inside 3GB each)

import os, errno, sys
import random, time
import mmap



def create_binary_test_file():
    print("Creating files with 3,000,000,000 characters, takes a few seconds...")
    test_binary_file1 = open("test_binary_file1.testbin", "wb")
    test_binary_file2 = open("test_binary_file2.testbin", "wb")
    test_binary_file3 = open("test_binary_file3.testbin", "wb")
    for i in range(1000):
        if i % 100 == 0 :
            print("progress -  ", i/10, " % ")
        # efficiently create random strings and write to files
        tbl = bytes.maketrans(bytearray(range(256)),
                          bytearray([ord(b'a') + b % 26 for b in range(256)]))
        random_string = (os.urandom(3000000).translate(tbl))
        test_binary_file1.write(str(random_string).encode('utf-8'))
        test_binary_file2.write(str(random_string).encode('utf-8'))
        test_binary_file3.write(str(random_string).encode('utf-8'))
    test_binary_file1.close()
    test_binary_file2.close()
    test_binary_file3.close()
    print("Created binary file for testing.The file contains 3,000,000,000 characters")




# Opening binary test file
try:
    binary_file = open("test_binary_file1.testbin", "r+b")
except OSError as e: # this would be "except OSError, e:" before Python 2.6
    if e.errno == errno.ENOENT: # errno.ENOENT = no such file or directory
        create_binary_test_file()
        binary_file = open("test_binary_file1.testbin", "r+b")




## example of use - perform 100 times, in each itteration: open one of the binary files and retrieve 5,000 sample strings
## (if code runs fast and without a slowdown - increase the k or other numbers and it should reproduce the problem)

## Example 1 - getting information from start of file
print("Getting information from start of file")
etime = []
for i in range(100):
    start = time.time()
    binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
    sample_index_list = random.sample(range(1,100000-1000), k=50000)
    sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
    binary_file_mm.close()
    binary_file.close()
    file_number = random.randint(1, 3)
    binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
    etime.append((time.time() - start))
    if i % 10 == 9 :
        print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()


## Example 2 - getting information from all of the file
print("Getting information from all of the file")
binary_file = open("test_binary_file1.testbin", "r+b")
etime = []
for i in range(100):
    start = time.time()
    binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
    sample_index_list = random.sample(range(1,3000000000-1000), k=50000)
    sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
    binary_file_mm.close()
    binary_file.close()
    file_number = random.randint(1, 3)
    binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
    etime.append((time.time() - start))
    if i % 10 == 9 :
        print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()

My results: (The average time of getting information from all across the file is almost 4 times slower than getting information from the beginning, with ~100 files and parallel computing this difference gets much bigger)

Getting information from start of file
Iter  9         Average time -  0.14790
Iter  19        Average time -  0.14590
Iter  29        Average time -  0.14456
Iter  39        Average time -  0.14279
Iter  49        Average time -  0.14256
Iter  59        Average time -  0.14312
Iter  69        Average time -  0.14145
Iter  79        Average time -  0.13867
Iter  89        Average time -  0.14079
Iter  99        Average time -  0.13979
Getting information from all of the file
Iter  9         Average time -  0.46114
Iter  19        Average time -  0.47547
Iter  29        Average time -  0.47936
Iter  39        Average time -  0.47469
Iter  49        Average time -  0.47158
Iter  59        Average time -  0.47114
Iter  69        Average time -  0.47247
Iter  79        Average time -  0.47881
Iter  89        Average time -  0.47792
Iter  99        Average time -  0.47681
artembus
  • 427
  • 1
  • 6
  • 13
  • @Danny_ds Sorry for the confusion, I meant that overall I have hundreds of different files that will be accessed. I use a sensible number of threads between 4-16. – artembus Aug 10 '19 at 15:01

2 Answers2

2

The basic reason why you have this time difference is that you have to seek to where you need in the file. The further from position 0 you are, the longer it's going to take.

What might help is since you know the starting index you need, seek on the file descriptor to that point and then do the mmap. Or really, why bother with mmap in the first place - just read the number of bytes that you need from the seeked-to position, and put that into your result variable.

James McPherson
  • 2,476
  • 1
  • 12
  • 16
  • That was my impression of mmap - I thought it reads the information directly from memory without seeking (memory mapped). From my previous research it looks like I can get instant random access memory without loading the whole file with mmap in python, so how do I approach your solution with mmap? (I couldn't find something too similar from the brief googling I did now, which is strange for a default python package...) – artembus Jun 17 '19 at 11:47
  • Once you've got the file mapped into memory, sure, you've got instant access. Before then, however, the kernel still needs to go through the same operations in order to get the data that you want into the address space. Your system's mmap implementation is probably only mapping in the first page of the file by default. You could use a different pagesize when creating the mapping but that might wiggle your performance in undesirable ways :| – James McPherson Jun 17 '19 at 11:52
  • Now that you mentioned pages, I actually did try to test loading specific pages around the desired index and reading the information this way, I didn't see any change in performance but maybe I wrote something wrong in my code. – artembus Jun 17 '19 at 11:58
  • _"The further from position 0 you are, the longer it's going to take."_ - this led to [another question](https://stackoverflow.com/questions/57418396/efficiently-reading-few-lines-from-a-very-large-binary-file) where I disagreed with that. My take on the reason behind the difference in performance is in a separate answer I just posted here. – Nickolay Aug 09 '19 at 10:57
1

To determine if you're getting adequate performance, check the memory available for the buffer/page cache (free in Linux), I/O stats - the number of reads, their size and duration (iostat; compare with the specs of your hardware), and the CPU utilization of your process.

[edit] Assuming that you read from a locally attached SSD (without having the data you need in the cache):

  • When reading in a single thread, you should expect your batch of 50,000 reads to take more than 7 seconds (50000*0.000150). Probably longer because the 50k accesses of a mmap-ed file will trigger more or larger reads, as your accesses are not page-aligned - as I suggested in another Q&A I'd use simple seek/read instead (and open the file with buffering=0 to avoid unnecessary reads for Python buffered I/O).
  • With more threads/processes reading simultaneously, you can saturate your SSD throughput (how much 4KB reads/s it can do - it can be anywhere from 5,000 to 1,000,000), then the individual reads will become even slower.

[/edit]

The first example only accesses 3*100KB of the files' data, so as you have much more than that available for the cache, all of the 300KB quickly end up in the cache, so you'll see no I/O, and your python process will be CPU-bound.

I'm 99.99% sure that if you test reading from the last 100KB of each file, it will perform as well as the first example - it's not about the location of the data, but about the size of the data accessed.

The second example accesses random portions from 9GB, so you can hope to see similar performance only if you have enough free RAM to cache all of the 9GB, and only after you preload the files into the cache, so that the testcase runs with zero I/O.

In realistic scenarios, the files will not be fully in the cache - so you'll see many I/O requests and much lower CPU utilization for python. As I/O is much slower than cached access, you should expect this example to run slower.

Nickolay
  • 31,095
  • 13
  • 107
  • 185
  • I tried free and iostat during the run. iostat showed this line "Device: sdb tps - 303.13 kB_read/s - 2212.15 kB_wrtn/s - 0.00 kB_read - 817040433 kB_wrtn - 1508" with kB_read constantly rising. free showed "Mem: total - 32816024 used - 7480252 free - 251236 shared - 576680 buff/cache - 25084536 available - 24137328" while the free and buff/chache memory always fluctuating. – artembus Aug 10 '19 at 15:11
  • Additionally, since my script doesn't slow down progressively more when accessing information further from the start point but instead drops significantly after the file gets large enough - your explanation makes more sense. Meaning before a certain point my memory is enough to load the data, but after that point the memory is not sufficient and io operations takes place which slows down the process. But that's my entire point... I want to read only small parts from a large binary file and I have the exact index of those parts, but it seems I can't do it in python without loading the file. – artembus Aug 10 '19 at 15:20
  • 1
    130 tps is way too low if you're trying to test the "cold cache" performance, where stuff gets mostly read from the disk - you should [purge the page cache](https://unix.stackexchange.com/questions/87908/how-do-you-empty-the-buffers-and-cache-on-a-linux-system). Your last comment sounds like you want to read the data from disk without waiting for it to be read from disk - I added an [edit] to my answer with information on how slow you should expect reads to be, if they end up accessing the disk. – Nickolay Aug 10 '19 at 17:24
  • Thanks for the edit, it sounds a bit clearer. I already tried the clearing cache strategy and stumbled across the same thread you linked, but it seems it doesn't help in my case. Now that we understand the problem, do you think rewriting the data so that instead of getting 5K bits of information from across the file those bits would be all together in a chunk would make my code faster? or accessing the file and reading it still would be slow and the performance/speed will be the same? – artembus Aug 11 '19 at 07:41
  • 1
    Clearing the cache isn't supposed to "help" with the performance, only to make the measurements stable. In the presence of disk cache, the performance varies a lot (milliseconds to seconds) depending on what's cached, confusing matters. Making one request to read 50MB will certainly be much faster than making 50000 requests. The total throughput of lots of parallel processes might be the same on an SSD, I'm not sure how bad is your current performance in that scenario, and how fast is your I/O subsystem - you can expect something on the order of 1GB/s for typical SSDs. – Nickolay Aug 11 '19 at 07:56
  • I have a followup question, as I run the script (that loads the data and perform some analysis on it) I noticed that it starts running with a decent speed. But as it keeps running, after several hours there is a gradual slow down of up to X2 then X3 then X4 and of the original run time per "cycle" and keeps getting worse. Do you have an idea about what's going on? if I completely stop the script and leave the PC for a few minutes then run the script - the script starts performing well but gradually slows down again. I tried a simple CTRL+Z for a few minutes but it didn't help. – artembus Aug 12 '19 at 09:58
  • 1
    No, I have not, but it sounds unrelated to the page cache, as restarting and pausing the process behave differently. – Nickolay Aug 12 '19 at 10:15