In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

Question

I'm running SageMath 9.0, on Windows 10 OS

I've read several similar questions (and answers) in this site. Mainly this one one reading from the 7th line, and this one on optimizing. But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.

I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters. Each line has a constant number of characters. Here are the actual first 5 lines:

J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...

For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format. The file has been computed and made available by Brendan McKay on its webpage here.

I need to check every graph for some properties. I could use the generator for G in graphs(11) but this can be very long (few days at least on my laptop). I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.

My current code reads the file line by line from start, and do some computation after reading each line :

with open(filename,'r') as file:
    while True: 
        # Get next line from file 
        line = file.readline() 

        # if line is empty, end of file is reached 
        if not line: 
            print("End of Database Reached")
            break  
        
        G = Graph()
        from_graph6(G,line.strip())

        run_some_code(G)

In order to be able to stop the code, or save the progress in case of crash, I was thinking of :

Every million line read (or so), save the progress in a specific file
When restarting the code, read the last saved value, and instead of using line = file.readline(), I would use itertool option, for line in islice(file, start_line, None).

so that my new code is

 from itertools import islice
 start_line = load('foo')
 count = start_line 
 save_every_n_lines = 1000000


 with open(filename,'r') as file:
     for line in islice(file, start_line, None):
         G = Graph()
         from_graph6(G,line.strip())

         run_some_code(G)
         count +=1

         if (count % save_every_n_lines )==0:
             save(count,'foo')

The code does work, but I would like to understand if I can optimise it. I'm not a big fan of my if statement in my for loop.

Is the itertools.islice() the good option here ? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached". As "start" could be quite large, ad given that I'm working on simple text files, could there be a faster option, in order to "jump" directly to the start line?
Knowing that the text file is fixed, could it be more optimal to split the actual file into 100 or 1000 smaller files and reading them one by one ? this would get read of the if statement in my for loop.
I also have the option to read blocks of line in one go instead of line by line, and then work on a list of graphs. Could that be a good option ?

Each line has a constant number of characters. So "jumping" might be feasible.

What do the actual lines look like? Can you post a few? Are the lines always the same size? — tdelaney, Jul 17 '20 at 00:17
Oh wait, those are the exact lines? Is it text or binary? I suspect those ? are non-displayable characters. Perhaps an example reading as binary, say `open('thefile', 'rb').read(80)` — tdelaney, Jul 17 '20 at 00:18
"jumping" to a line in a text file would require knowing its offset, which generally can't be done because each line can have a different length. If they are (or can be made to be) the same size, you could easily calculate the offset. Another approach would be to record the offset of each one before it's read, and store that value instead of the line number for resuming. — martineau, Jul 17 '20 at 00:21
The example I put are the actual file first lines. They should all have the same length (I'm checking right now on some random lines and in the format definition). Each line is made of ASCII printable characters. I'll edit the post with these info. — Thomas Lesgourgues, Jul 17 '20 at 00:33
Link to Brendan McKay webpage (i.e. http://users.cecs.anu.edu.au/~bdm/data/graphs.html) gives a 502 Bad Gateway error. — DarrylG, Jul 17 '20 at 01:07
@DarrylG yes I noticed. It was working well yesterday. Should be only temporary error. — Thomas Lesgourgues, Jul 17 '20 at 01:14

score 1 · Accepted Answer · answered Jul 17 '20 at 01:36

Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell. The memory mapped file emulates a bytearray and you can take record-sized slices from the array for the data you want. If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.

This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.

import os
import mmap

# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1 

# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
    for i in range(100):
        f.write("R{: 10}\n".format(i).encode('ascii'))

f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ] 
print("record 20", record)

# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
    return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]

print("get record 20", get_record(data, 11))

# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
    if stop is None:
        stop = mmapped_file.size()/LINE_SZ
    for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
        yield mmapped_file[pos:pos+RECORD_SZ]

print("enum 6 to 8", [record for record in enum_records(data,6,9)])

del data
f.close()

Thanks, I'll set it up with my windows OS and let you know how it worked. — Thomas Lesgourgues, Jul 17 '20 at 03:46

score 0 · Answer 2 · answered Jul 17 '20 at 12:35

0

If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do

def get_line(k, line_len):
    with open('file') as f:
        f.seek(k*line_len)
        return next(f)

answered Jul 17 '20 at 12:35

enedil

1,605
4
18
34

In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

2 Answers2