2

I am downloading some data from the db, storing in a numpy array, and performing some clean up on the array based on the contents of a particular column. This is my code that I am using to delete some rows :

def clean_data(data,column):
    target_data = data[:,column].astype(int)
    pos_to_delete = np.where(target_data==1)[0]
    data = np.delete(data,pos_to_delete,axis=0)
    return data

I get the following error in numpy.

Traceback (most recent call last):
File "data_download.py", line 111, in download_data
data = clean_data(data)
File "/home/work/data_clean.py", line 13, in data_clean.py
data = np.delete(data,pos_to_delete,axis=0)
File "/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py", line 4262, in delete
new = arr[tuple(slobj)]
MemoryError

PS - If I get data from the db and dunp to a file, then read this file and perform clean up, this error does'nt show anymore. Solutions to this question Is there any way to delete the specific elements of an numpy array "In-place" in python: are'nt helping. How do I delete with inplace=True and also take care of the Memory issue? Can anyone please help? Thanks in advance.

Bidisha Das
  • 177
  • 2
  • 11
  • The error is produced when `delete` creates the array that will return the result. It then intends to fill it with the 'keeper' values from the source. `delete` always returns a new array. Looks like other objects such as the source DataFrame are taking up a lot of memory, leaving little memory for further manipulation. – hpaulj Feb 22 '19 at 17:03
  • What's the size (shape) of `target_data` and `pos_to_delete`? – hpaulj Feb 22 '19 at 17:04
  • Data shape is (5L,40), so target data is (5L,1) – Bidisha Das Feb 23 '19 at 06:05
  • Doesn't seem large enough to cause memory errors. – hpaulj Feb 23 '19 at 06:09

1 Answers1

0

np.delete takes several routes depending on nature of the obj array. In this, case where it is generated by where, and thus is an array of indices to remove, it takes the following route:

def foo1(data, idx):
    msk = np.ones(data.shape[0],bool)
    msk[idx] = False
    return data[msk, :]

That is it constructs a boolean mask True, and sets the selected items to False. arr[tuple(slobj)] is a slightly more general version to handle the axis parameter. But in your case axis is 0, so I can simplify it to [msk,:].

So the msk is just a 1d boolean the size of the number of rows of your data.

np.delete(target_data,pos_to_delete,axis=0) would return the target_data column minus the deletes, a fairly small 1d array.

np.delete(data, ...) will return an array of comparable size to data, minus however many you delete.

This makes me think that your data is so large that there's barely any room to do any calculations with it, even something so simple as making a copy.

hpaulj
  • 221,503
  • 14
  • 230
  • 353