3

I have a numpy array with duplicate columns:

import numpy as np

A = np.array([[1, 1, 1, 0, 1, 1],
              [1, 2, 2, 0, 1, 2],
              [1, 3, 3, 0, 1, 3]])

I need to find the indexes to those duplicates or something like that:

[0, 4]

[1, 2, 5]

I have a hard time dealing with indexes in Python. I really don't know to approach it.

Thanks

I tried identifying the unique columns first with this function:

 def unique_columns(data):
     ind = np.lexsort(data)
     return data.T[ind[np.concatenate(([True], any(data.T[ind[1:]]!=data.T[ind[:-1]], axis=1)))]].T

But I can't figure out the indexes from there.

user3329302
  • 627
  • 2
  • 8
  • 12
  • Do you need numpy performance, or is pure python implementation OK? – wim Feb 19 '14 at 18:40
  • You're close, you've found all the unique columns. Each True value is where a new group starts. `ind` has all the indices you want, but by indexing ind you're just taking one value instead of all of them. Try taking all the values in `ind` between consecutive Trues. – Bi Rico Feb 19 '14 at 18:51
  • Thanks guys. I think that would do. I am new to Python; being from the C++ old school I find it unnatural to deal with indexes in Python. I used numpy because my arrays are quite large [300000, 1000] – user3329302 Feb 19 '14 at 18:53

2 Answers2

2

There is not a simple way to do this unfortunately. Using a np.unique answer. This method requires that the axis you want to unique is contiguous in memory and numpy's typical memory layout is C contiguous or contiguous in rows. Fortunately numpy makes this conversion simple:

A = np.array([[1, 1, 1, 0, 1, 1],
              [1, 2, 2, 0, 1, 2],
              [1, 3, 3, 0, 1, 3]])

def unique_columns2(data):
    dt = np.dtype((np.void, data.dtype.itemsize * data.shape[0]))
    dataf = np.asfortranarray(data).view(dt)
    u,uind = np.unique(dataf, return_inverse=True)
    u = u.view(data.dtype).reshape(-1,data.shape[0]).T
    return (u,uind)

Our result:

u,uind = unique_columns2(A)

u
array([[0, 1, 1],
       [0, 1, 2],
       [0, 1, 3]]) 
uind
array([1, 2, 2, 0, 1, 2])

I am not really sure what you want to do from here, for example you can do something like this:

>>> [np.where(uind==x)[0] for x in range(u.shape[0])]
[array([3]), array([0, 4]), array([1, 2, 5])]

Some timings:

tmp = np.random.randint(0,4,(30000,500))

#BiRico and OP's answer
%timeit unique_columns(tmp)
1 loops, best of 3: 2.91 s per loop

%timeit unique_columns2(tmp)
1 loops, best of 3: 208 ms per loop
Daniel
  • 19,179
  • 7
  • 60
  • 74
  • There is a simple way to do this. – Bi Rico Feb 19 '14 at 18:42
  • @BiRico Then please show it rather then giving vague hints. Plus I do not see lexsort being faster then `unique` for a large number of rows. – Daniel Feb 19 '14 at 18:53
  • Feel free to implement it if you'd like, I don't have my computer right now to write it out and test it. – Bi Rico Feb 19 '14 at 18:54
  • 1
    +1 We had the lexsort vs void dtype discussion [here](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array/16973510#16973510). I like this approach much better, and I think I timed it back then and it also performed better, especially for very long columns. – Jaime Feb 19 '14 at 20:02
  • @Jamie Perhaps it is time to look at implementing this in numpy itself. Also, I had completely forgotten that I answer this exact same questions awhile ago. – Daniel Feb 19 '14 at 20:15
  • @Ophion I believe to make this work in general you're missing a +1 in here: [np.where(uind==x)[0] for x in range(u.shape[0]+1)] – PDRX May 03 '16 at 10:39
0

Here is an outline of how to approach it. Use numpy.lexsort to sort the columns, that way all the duplicates will be grouped together. Once the duplicates are all together, you can easily tell which columns are duplicates and the indices that correspond with those columns.

Here's an implementation of the method described above.

import numpy as np

def duplicate_columns(data, minoccur=2):
    ind = np.lexsort(data)
    diff = np.any(data.T[ind[1:]] != data.T[ind[:-1]], axis=1)
    edges = np.where(diff)[0] + 1
    result = np.split(ind, edges)
    result = [group for group in result if len(group) >= minoccur]
    return result

A = np.array([[1, 1, 1, 0, 1, 1],
              [1, 2, 2, 0, 1, 2],
              [1, 3, 3, 0, 1, 3]])
print(duplicate_columns(A))
# [array([0, 4]), array([1, 2, 5])]
Bi Rico
  • 25,283
  • 3
  • 52
  • 75