0

I have a file bad_orders.csv approximately 16GB to be read into a numpy array within a 58GB RAM machine.

ubuntu@ip-172-31-22-232:~/Data/Autoencoder_Signin/joined_signin_rel$ free -g
          total        used        free      shared  buff/cache   available
Mem:             58           0          58           0           0          58
Swap:             0           0           0

When I run the following command, the job has been killed repeatedly:

import numpy as np
arr = np.genfromtxt('bad_orders.csv', delimiter =',', missing_values='',dtype='float32')

The terminal shows that it is using disproportionate memory:

ubuntu@ip-172-31-22-232:~$ free -g
          total        used        free      shared  buff/cache   available
Mem:             58          42          12           0           3          16
Swap:             0           0           0

Then I tried to sample 10000 rows from the original file and check the memory usage:

In [7]: samples = np.genfromtxt('samples.csv',delimiter=',', 
missing_values='', dtype='float32')

In [8]: samples.nbytes
Out[8]: 16680000

The sample numpy array shows size of 0.017GB. My file in total has ~8M rows, so if the memory usage scales linearly the large numpy array should take 13GB memory. Why is it taking more than 50GB when I was reading the whole file?

user2517984
  • 125
  • 2
  • 14
  • 1
    Usage while loading the file will be more than the final array usage. How many columns? 2085? (8 bytes per item, 1000 rows). What's the typical line width? – hpaulj Jun 12 '18 at 22:36
  • Try a iterator/generator with `np.fromiter`. Might be slower, but should be much more memory efficient. And it shouldn't be too bad if you know the size beforehand. Something like `(map(float, row) for row in csv.reader(open('myfile.csv'))` – juanpa.arrivillaga Jun 12 '18 at 22:49
  • With pure floats, no missing values, and a simple delimiter, `genfromtxt` is probably overkill. It's more useful when you need to use header field names, and automatically deduced field dtypes. – hpaulj Jun 12 '18 at 23:44
  • @hpaulj The total number of columns is 417. The reason why I used `genfromtxt` is that `loadtxt` cannot handle missing values, in which case my missing value is empty string. – user2517984 Jun 12 '18 at 23:58
  • Please let us know what happens, did you try pandas? – anishtain4 Jun 13 '18 at 14:11
  • @anishtain4 Yep. Pandas seems like having the same problem...Memory usage exceeds the data size itself. – user2517984 Jun 13 '18 at 17:42

1 Answers1

0

The genfromtxt has a lot of type checking and is only intended for small files. For larger files you're better off with loadtxt, though still uses much more memory than the file as mentioned here. Yet another better way is to use pandas' read_csv.

anishtain4
  • 2,342
  • 2
  • 17
  • 21
  • I thought both `genfromtxt` and `loadtxt` collect the data in a list of lists, one sublist per line of the `csv` file. `genfromtxt` may do more checking, but I don't think that will change the memory usage. But I haven't examined the code of both with that in mind. – hpaulj Jun 12 '18 at 22:33
  • 1
    @hpaulj Check the link I've provided. They have done speed tests as well as explaining why it's slower. Pandas is faster according to this link: http://akuederle.com/stop-using-numpy-loadtxt. I haven't done memory profiling on it, but I think since it's intended for big data, should be much more memory efficient as well. – anishtain4 Jun 12 '18 at 22:40
  • The `pandas` one has two modes, the faster compiled, and slower Python version with more features. – hpaulj Jun 12 '18 at 23:39
  • The main difference seems to be that `loadtxt` loads the file in 50,000 line chunks, and converts the strings with a specified `dtype`. `genfromtxt` splits all lines, and does the `dtype` conversion once, possibly using a dtype that it has deduced. – hpaulj Jun 13 '18 at 01:25