I have a large 2D Numpy array like arr = np.random.randint(0,255,(243327132, 3), dtype=np.uint8)
.
I'm trying to get the unique rows of the array. Using np.unique I get the following memory error:
unique_arr = np.unique(arr,axis=1)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
In [96]:
Line 1: unique_arr = np.unique(arr,axis=1)
File <__array_function__ internals>, in unique:
Line 5:
File C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\numpy\lib\arraysetops.py, in unique:
Line 276: dtype = [('f{i}'.format(i=i), ar.dtype) for i in range(ar.shape[1])]
File C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\numpy\lib\arraysetops.py, in <listcomp>:
Line 276: dtype = [('f{i}'.format(i=i), ar.dtype) for i in range(ar.shape[1])]
MemoryError:
---------------------------------------------------------------------------
Trying np.memmap to perhaps reduce the amount of data the memory has to store. Two Versions one with the original array being mapped to memory, the second one where also the target array is memmapped:
memarr = numpy.memmap("memarr.memmap",mode='w+',dtype="uint8",shape=arr.shape)
unique_arr = np.unique(memarr,axis=1)
or
memarr = numpy.memmap("memarr.memmap",mode='w+',dtype="uint8",shape=arr.shape)
uniquearr = numpy.memmap("unique.memmap",mode='w+',dtype="uint8",shape=arr.shape)
uniarr[:] = np.unique(memarr,axis=1)
Both these result in the exact same previous error. I can't divide the data into chunks because as far as I'm aware np.unique than only determines the unique values of the chunk. I also got the unique values themselves by creating frozenset out of the rows and adding them to a set. However, I do need eventually both the count and the index of the values as well so neither this option nor np.bincount seem applicable.
I have 16GB RAM and a 64 bit Version of Python installed. Is there a way to avoid the memory error ?