Memory Error using np.unique on large array to get unique rows

Question

I have a large 2D Numpy array like arr = np.random.randint(0,255,(243327132, 3), dtype=np.uint8).

I'm trying to get the unique rows of the array. Using np.unique I get the following memory error:

unique_arr = np.unique(arr,axis=1)

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
In  [96]:
Line 1:     unique_arr = np.unique(arr,axis=1)

File <__array_function__ internals>, in unique:
Line 5:     

File C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\numpy\lib\arraysetops.py, in unique:
Line 276:   dtype = [('f{i}'.format(i=i), ar.dtype) for i in range(ar.shape[1])]

File C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\numpy\lib\arraysetops.py, in <listcomp>:
Line 276:   dtype = [('f{i}'.format(i=i), ar.dtype) for i in range(ar.shape[1])]

MemoryError: 
---------------------------------------------------------------------------

Trying np.memmap to perhaps reduce the amount of data the memory has to store. Two Versions one with the original array being mapped to memory, the second one where also the target array is memmapped:

memarr = numpy.memmap("memarr.memmap",mode='w+',dtype="uint8",shape=arr.shape)
unique_arr = np.unique(memarr,axis=1)

or

memarr = numpy.memmap("memarr.memmap",mode='w+',dtype="uint8",shape=arr.shape)
uniquearr = numpy.memmap("unique.memmap",mode='w+',dtype="uint8",shape=arr.shape)

uniarr[:] = np.unique(memarr,axis=1)

Both these result in the exact same previous error. I can't divide the data into chunks because as far as I'm aware np.unique than only determines the unique values of the chunk. I also got the unique values themselves by creating frozenset out of the rows and adding them to a set. However, I do need eventually both the count and the index of the values as well so neither this option nor np.bincount seem applicable.

I have 16GB RAM and a 64 bit Version of Python installed. Is there a way to avoid the memory error ?

What is the approximate size of the result you expect? Nearly all the rows unique? 1/10 of rows unique, etc? In the latter case, you can merge chunks — Mad Physicist, Aug 16 '22 at 20:06
I am pretty sure `np.unique(memarr,axis=1)` do not do what you think it does: it create a new array so the duplicated **columns** of size 243327132 so the result contains 1, 2 or 3 columns (sorted). It do not iterates over the rows (which is common). Is it what you actually needs? If so, do you need the columns to be sorted? — Jérôme Richard, Aug 16 '22 at 20:27

AlberNovo · Answer 1 · 2022-08-17T02:09:16.360

Is there a way to avoid the memory error ?

Yes. The idea comes from this blog post, where the author explore eight different ways to find unique rows in numpy (Actually this comes from stackoverflow). Quoting from blog's 4th solution with code adapted to your proposed problem,

If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy’s structured arrays.

The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn’t make a copy, and is quite efficient.

arr = np.random.randint(0,255,(243327132, 3), dtype=np.uint8)

def unique_rows(arr):
    uniq = np.unique(arr.view(arr.dtype.descr * arr.shape[1]))
    return uniq.view(arr.dtype).reshape(-1, arr.shape[1])

unique_arr = unique_rows(arr)

So this may solve your problem, at least it worked for me without blowing up my RAM.
Interesting problem, I learned something new :)

Memory Error using np.unique on large array to get unique rows

1 Answers1