I have a file bad_orders.csv approximately 16GB to be read into a numpy array within a 58GB RAM machine.
ubuntu@ip-172-31-22-232:~/Data/Autoencoder_Signin/joined_signin_rel$ free -g
total used free shared buff/cache available
Mem: 58 0 58 0 0 58
Swap: 0 0 0
When I run the following command, the job has been killed repeatedly:
import numpy as np
arr = np.genfromtxt('bad_orders.csv', delimiter =',', missing_values='',dtype='float32')
The terminal shows that it is using disproportionate memory:
ubuntu@ip-172-31-22-232:~$ free -g
total used free shared buff/cache available
Mem: 58 42 12 0 3 16
Swap: 0 0 0
Then I tried to sample 10000 rows from the original file and check the memory usage:
In [7]: samples = np.genfromtxt('samples.csv',delimiter=',',
missing_values='', dtype='float32')
In [8]: samples.nbytes
Out[8]: 16680000
The sample numpy array shows size of 0.017GB. My file in total has ~8M rows, so if the memory usage scales linearly the large numpy array should take 13GB memory. Why is it taking more than 50GB when I was reading the whole file?