0

I have a dataset with 450,000 columns and 450 rows - all numeric values. I load the dataset into a NumPy array with the np.genfromtxt() function:

# The skip_header skips over the column names, which is the first row in the file
train = np.genfromtxt('train_data.csv', delimiter=',', skip_header=1)

train_labels = train[:, -1].astype(int)
train_features = train[:, :-1]

When the function is initially loading the dataset, it uses upwards of 15-20 GB of RAM. However, after the function finishes running, it goes down to only 2-3 GB of RAM usage. Why is np.genfromtxt() initially using up so much RAM?

Randy Olson
  • 3,131
  • 2
  • 26
  • 39
  • What's the size of your file and what kind of data types are stored in it? – Mazdak Mar 02 '18 at 19:59
  • `genfromtxt` reads the file line by line, split each into a list of strings. It accumulates these in a list, and builds the array at the end. Keep in mind that it doesn't know the total size of the return array ahead of time. It might not even know the required `dtype`. In your case you didn't specify dtype, so it parsed everthing as floats. – hpaulj Mar 02 '18 at 20:06
  • @Kasramvd: It's a 1.7 GB file on the hard drive. 450k columns and 450 rows, all float values. – Randy Olson Mar 02 '18 at 20:12
  • @hpaulj I want it to parse all of the values as floats. I tried explicitly specifying the dtype as `np.float` and that didn't seem to help with the initial excessive memory usage. – Randy Olson Mar 02 '18 at 20:14
  • I suppose I should extend my question to ask: Can I avoid this initial excessive memory usage with a parameter setting of `genfromtxt`, or from using a different function? I was initially using pandas to read the data file and that was even worse (slower and used even more memory). – Randy Olson Mar 02 '18 at 20:16
  • As It's stated in documentation, since version 1.10 a `max_rows` argument has been added to `genfromtxt` to limit the number of rows read in a single call. Using this functionality, it is possible to read in multiple arrays stored in a single file by making repeated calls to the function. – Mazdak Mar 02 '18 at 20:39
  • Also if it's possible use smaller float types like `float32` or `float16`. And take a look at this question as well. https://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy – Mazdak Mar 02 '18 at 20:43
  • The `iter_loadtxt` solution there is brilliant. That should be integrated into NumPy. Thanks a ton, @Kasramvd! Feel free to add that solution as an answer here and I'll mark it as the solution. – Randy Olson Mar 02 '18 at 21:00
  • With many columns and few rows your file may benefit from a different approach. Especially if you know the exact size ahead of time. – hpaulj Mar 02 '18 at 22:01

2 Answers2

0

If you know the size of the array ahead of time, you could save time and space by loading each line into a target array as it is parsed.

For example:

In [173]: txt="""1,2,3,4,5,6,7,8,9,10
     ...: 2,3,4,5,6,7,8,9,10,11
     ...: 3,4,5,6,7,8,9,10,11,12
     ...: """

In [174]: np.genfromtxt(txt.splitlines(),dtype=int,delimiter=',',encoding=None)
Out[174]: 
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])

With a simpler parsing function:

In [177]: def foo(txt,size):
     ...:     out = np.empty(size, int)
     ...:     for i,line in enumerate(txt):
     ...:        out[i,:] = line.split(',')
     ...:     return out
     ...: 
In [178]: foo(txt.splitlines(),(3,10))
Out[178]: 
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])

out[i,:] = line.split(',') loading a list of strings into a numeric dtype array forces a conversion, the same as np.array(line..., dtype=int).

In [179]: timeit np.genfromtxt(txt.splitlines(),dtype=int,delimiter=',',encoding
     ...: =None)
266 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [180]: timeit foo(txt.splitlines(),(3,10))
19.2 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The simpler, direct parser is much faster.

However if I try a simplified version of what loadtxt and genfromtxt use:

In [184]: def bar(txt):
     ...:     alist=[]
     ...:     for i,line in enumerate(txt):
     ...:        alist.append(line.split(','))
     ...:     return np.array(alist, dtype=int)
     ...: 
     ...: 
In [185]: bar(txt.splitlines())
Out[185]: 
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])
In [186]: timeit bar(txt.splitlines())
13 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

For this small case it's even faster. genfromtxt must have a lot of parsing overhead. This is a small sample, so memory consumption doesn't matter.


for completeness, loadtxt:

In [187]: np.loadtxt(txt.splitlines(),dtype=int,delimiter=',')
Out[187]: 
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])
In [188]: timeit np.loadtxt(txt.splitlines(),dtype=int,delimiter=',')
103 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

with fromiter:

In [206]: def g(txt):
     ...:     for row in txt:
     ...:         for item in row.split(','):
     ...:             yield item
In [209]: np.fromiter(g(txt.splitlines()),dtype=int).reshape(3,10)
Out[209]: 
array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])
In [210]: timeit np.fromiter(g(txt.splitlines()),dtype=int).reshape(3,10)
12.3 µs ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

@Kasramvd made a good suggestion in the comments to look into the solutions proposed here. The iter_loadtxt() solution from that answer turned out to be the perfect solution for my issue:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

The reason genfromtxt() takes up so much memory is because it is not storing the data in efficient NumPy arrays while it is parsing the data file, thus the excessive memory usage while NumPy was parsing my large data file.

Randy Olson
  • 3,131
  • 2
  • 26
  • 39
  • Yes, generally, flat-text files are not an ideal way to store data. Use a different serialization method. `numpy.save` and `numpy.load` implement the `.npy` binary serialization format. It is faster, more memory efficient, and much more portable (not to mention you don't loose info on floats). – juanpa.arrivillaga Mar 03 '18 at 02:38