5

I have a large data file (N,4) which I am mapping line-by-line. My files are 10 GBs, a simplistic implementation is given below. Though the following works, it takes huge amount of time.

I would like to implement this logic such that the text file is read directly and I can access the elements. Thereafter, I need to sort the whole (mapped) file based on column-2 elements.

The examples I see online assumes smaller piece of data (d) and using f[:] = d[:]but I can't do that since d is huge in my case and eats my RAM.

PS: I know how to load the file using np.loadtxt and sort them using argsort, but that logic fails (memory error) for GB file size. Would appreciate any direction.

nrows, ncols = 20000000, 4  # nrows is really larger than this no. this is just for illustration
f = np.memmap('memmapped.dat', dtype=np.float32,
              mode='w+', shape=(nrows, ncols))

filename = "my_file.txt"

with open(filename) as file:

    for i, line in enumerate(file):
        floats = [float(x) for x in line.split(',')]
        f[i, :] = floats
del f
nuki
  • 101
  • 5
  • If you can split the files, you might be able to use dask. – James Jul 22 '20 at 20:23
  • @user13815479 the `nrows, ncols` in your example represent 320 MB of data, which should easily fit in memory. How big is it really? – Han-Kwang Nienhuys Jul 22 '20 at 20:34
  • @James: Thanks, I am a newbie. Can you please elaborate more? If you can share a MWE, I would really appreciate that. – nuki Jul 22 '20 at 20:34
  • @Han-KwangNienhuys: That's just a simple example to show my logic. My text file is really large (10 GB) and N is large accordingly. – nuki Jul 22 '20 at 20:49
  • `loadtxt` (and `genfromtxt`) read csv files line by line, accumulating values in a list of lists (or arrays) which is converted to an array at the end. `pandas` has a `c` based mode for its `pd.read_csv` which is faster - but the result is a dataframe. – hpaulj Jul 22 '20 at 21:19
  • https://stackoverflow.com/questions/21653738/parsing-large-9gb-file-using-python – Lewis Farnworth Jul 22 '20 at 21:48

2 Answers2

3

EDIT: Instead of do-it-yourself chunking, it's better to use the chunking feature of pandas, which is much, much faster than numpy's load_txt.

import numpy as np
import pandas as pd

## create csv file for testing
np.random.seed(1)
nrows, ncols = 100000, 4
data = np.random.uniform(size=(nrows, ncols))
np.savetxt('bigdata.csv', data, delimiter=',')

## read it back
chunk_rows = 12345
# Replace np.empty by np.memmap array for large datasets.
odata = np.empty((nrows, ncols), dtype=np.float32)
oindex = 0
chunks = pd.read_csv('bigdata.csv', chunksize=chunk_rows, 
                     names=['a', 'b', 'c', 'd'])
for chunk in chunks:
    m, _ = chunk.shape
    odata[oindex:oindex+m, :] = chunk
    oindex += m

# check that it worked correctly.
assert np.allclose(data, odata, atol=1e-7)

The pd.read_csv function in chunked mode returns a special object that can be used in a loop such as for chunk in chunks:; at every iteration, it will read a chunk of the file and return its contents as a pandas DataFrame, which can be treated as a numpy array in this case. The parameter names is needed to prevent it from treating the first line of the csv file as column names.

Old answer below

The numpy.loadtxt function works with a filename or something that will return lines in a loop in a construct such as:

for line in f: 
   do_something()

It doesn't even need to pretend to be a file; a list of strings will do!

We can read chunks of the file that are small enough to fit in memory and provide batches of lines to np.loadtxt.

def get_file_lines(fname, seek, maxlen):
    """Read lines from a section of a file.
    
    Parameters:
        
    - fname: filename
    - seek: start position in the file
    - maxlen: maximum length (bytes) to read
    
    Return:
        
    - lines: list of lines (only entire lines).
    - seek_end: seek position at end of this chunk.
    
    Reference: https://stackoverflow.com/a/63043614/6228891
    Copying: any of CC-BY-SA, CC-BY, GPL, BSD, LPGL
    Author: Han-Kwang Nienhuys
    """
    f = open(fname, 'rb') # binary for Windows \r\n line endings
    f.seek(seek)
    buf = f.read(maxlen)
    n = len(buf)
    if n == 0:
        return [], seek
    
    # find a newline near the end
    for i in range(min(10000, n)):
        if buf[-i] == 0x0a:
            # newline
            buflen = n - i + 1
            lines = buf[:buflen].decode('utf-8').split('\n')
            seek_end = seek + buflen
            return lines, seek_end
    else:
        raise ValueError('Could not find end of line')

import numpy as np

## create csv file for testing
np.random.seed(1)
nrows, ncols = 10000, 4

data = np.random.uniform(size=(nrows, ncols))
np.savetxt('bigdata.csv', data, delimiter=',')

# read it back        
fpos = 0
chunksize = 456 # Small value for testing; make this big (megabytes).

# we will store the data here. Replace by memmap array if necessary.
odata = np.empty((nrows, ncols), dtype=np.float32)
oindex = 0

while True:
    lines, fpos = get_file_lines('bigdata.csv', fpos, chunksize)
    if not lines:
        # end of file
        break
    rdata = np.loadtxt(lines, delimiter=',')
    m, _ = rdata.shape
    odata[oindex:oindex+m, :] = rdata
    oindex += m
    
assert np.allclose(data, odata, atol=1e-7)

Disclaimer: I tested this in Linux. I expect this to work in Windows, but it could be that the handling of '\r' characters causes problems.

Han-Kwang Nienhuys
  • 3,084
  • 2
  • 12
  • 31
  • Awesome! This seems to be working on a test 1GB file I have. Would appreciate if you can answer my specific questions: [1] odata = np.empty((nrows, ncols), dtype=np.float32) <= Any limit on nrows/ncols. I am afraid if that'd give me an error for really large (10-40 GB) files? [2] What is the for loop doing? Being a newbie, having a hard time in understanding. [3] So all the data is saved in odata and we are still consuming RAM, correct? Because on sorting "odata = odata[np.argsort(odata[:, 1])]" I get a memory error for large file. Any suggestions to perform sorting? – nuki Jul 22 '20 at 23:01
  • Regarding [1] and [2]: I updated the answer. Regarding [3] on how to sort a large numpy memmap array: you need to post that as a new question if existing answers on that topic don't work for you. – Han-Kwang Nienhuys Jul 23 '20 at 06:03
  • Thanks, with your help I can read and store the data in HDF5, but I am unable to perform sorting on each chunk and produce a final (Nx4) array sorted on column-2. Any suggestions? I just asked the SO community as well. – nuki Jul 28 '20 at 20:46
-1

I realize this could not be an answer, but have you considered using binary files? When files are very large, saving in ascii is very un-efficient. If you can, use np.save and np.load instead.

scarpma
  • 29
  • 3
  • Can you please elaborate more, considering I am a newbie? I basically have a large text file with 4 columns. You are saying converting that to binary file and then using np.load? Does that uses RAM? Would appreciate if you can share a MWE please? – nuki Jul 22 '20 at 22:18
  • Your answer is incomplete and should have been rather posted as a comment. – mac13k Jul 23 '20 at 07:52