Sorting data from a large text file and convert them into an array

Question

I have a text file that contain some data.

#this is a sample file
# data can be used for practice
total number = 5

t=1
dx= 10 10
dy= 10 10
dz= 10 10

1 0.1 0.2 0.3
2 0.3 0.4 0.1
3 0.5 0.6 0.9
4 0.9 0.7 0.6
5 0.4 0.2 0.1

t=2
dx= 10 10
dy= 10 10
dz= 10 10

1 0.11 0.25 0.32
2 0.31 0.44 0.12
3 0.51 0.63 0.92
4 0.92 0.72 0.63
5 0.43 0.21 0.14

t=3
dx= 10 10
dy= 10 10
dz= 10 10

1 0.21 0.15 0.32
2 0.41 0.34 0.12
3 0.21 0.43 0.92
4 0.12 0.62 0.63
5 0.33 0.51 0.14

My aim is to read the file, find out the row where column value is 1 and 5 and store them as multidimensional array. like for 1 it will be a1=[[0.1, 0.2, 0.3],[0.11, 0.25, 0.32],[0.21, 0.15, 0.32]] and for 5 it will be a5=[[0.4, 0.2, 0.1],[0.43, 0.21, 0.14],[0.33, 0.51, 0.14]].

Here is my code that I have written,

import numpy as np
with open("position.txt","r") as data:
    lines = data.read().split(sep='\n')
    a1 = []
    a5 = []
    for line in lines:

        if(line.startswith('1')):
            a1.append(list(map(float, line.split()[1:])))
        elif (line.startswith('5')):
            a5.append(list(map(float, line.split()[1:])))
a1=np.array(a1)
a5=np.array(a5)

My code is working perfectly with my sample file that I have uploaded but in real case my file is quite larger (2gb). Handling that with my code raise memory error. How can I solve this issue? I have 96GB in my workstation.

Does the second comment on the top answer to [this question](https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list) help you? — David Wierichs, Jun 11 '20 at 22:53
@DavidWierichs yes that helps. Actually I already did that but still the whole things are quite slow. The answer below is quite efficient. — Tanvir, Jun 12 '20 at 01:37

Han-Kwang Nienhuys · Accepted Answer · 2020-06-12T12:16:10.933

There are several things to improve:

Don't attempt to load the entire text file in memory (that will save 2 GB).
Use numpy arrays, not lists, for storing numerical data.
Use single-precision floats rather than double-precision.

So, you need to estimate how big your array will be. It looks like there may be 16 million records for 2 GB of input data. With 32-bit floats, you need 16e6*2*4=128 MB of memory. For a 500 GB input, it will fit in 33 GB memory (assuming you have the same 120-byte record size).

import numpy as np
nmax = int(20e+6) # take a bit of safety margin

a1 = np.zeros((nmax, 3), dtype=np.float32)
a5 = np.zeros((nmax, 3), dtype=np.float32)
n1 = n5 = 0

with open("position.txt","r") as data:
    for line in data:
        if '0' <= line[0] <= '9':
            values = np.fromstring(line, dtype=np.float32, sep=' ')
            if values[0] == 1:
                a1[n1] = values[1:] 
                n1 += 1
            elif values[0] == 5:
                a5[n5] = values[1:]
                n5 += 1

# trim (no memory is released)
a1 = a1[:n1]
a5 = a5[:n5]

Note that float equalities (==) are generally not recommended, but in the case of value[0]==1, we know that it's a small integer, for which float representations are exact.

If you want to economize on memory (for example if you want to run several python processes in parallel), then you could initialize the arrays as disk-mapped arrays, like this:

a1 = np.memmap('data_1.bin', dtype=np.float32, mode='w+', shape=(nmax, 3))
a5 = np.memmap('data_5.bin', dtype=np.float32, mode='w+', shape=(nmax, 3))

With memmap, the files won't contain any metadata on data type and array shape (or human-readable descriptions). I'd recommend that you convert the data to npz format in a separate job; don't run these jobs in parallel because they will load the entire array in memory.

n = 3
a1m = np.memmap('data_1.bin', dtype=np.float32, shape=(n, 3))
a5m = np.memmap('data_5.bin', dtype=np.float32, shape=(n, 3))
np.savez('data.npz', a1=a1m, a5=a5m, info='This is test data from SO')

You can load them like this:

data = np.load('data.npz')
a1 = data['a1']

Depending on the balance between cost of disk space, processing time, and memory, you could compress the data.

import zlib
zlib.Z_DEFAULT_COMPRESSION = 3 # faster for lower values
np.savez_compressed('data.npz', a1=a1m, a5=a5m, info='...')

If float32 has more precision than you need, you could truncate the binary representation for better compression.

If you like memory-mapped files, you can save in npy format:

np.save('data_1.npy', a1m)
a1 = np.load('data_1.npy', mmap_mode='r+')

But then you can't use compression and you'll end up with many metadata-less files (except array size and datatype).

Thanks for your nice suggestion. One more problem comes. If I sort as '1' , it takes any rows that starts with column 1 (ex:1,11,121 etc). Is it possible to sort only '1' or '11' ? — Tanvir, Jun 12 '20 at 01:34
Regarding data 2 GB is only one file..but I have some files that contain 500 gb data or more (molecular dynamics simulation). Just looking for something in terms of efficiency. Does multiprocessing help in any case? I do have 80 core processor. — Tanvir, Jun 12 '20 at 01:42
I updated the answer to cover memory-mapped files, multiprocessing, and numbers above 9. — Han-Kwang Nienhuys, Jun 12 '20 at 10:26
If you like the answer, you are allowed to click the upvote button. ;) — Han-Kwang Nienhuys, Jun 12 '20 at 19:06

Sorting data from a large text file and convert them into an array

1 Answers1