4

I have a 2D array, and it has some duplicate columns. I would like to be able to see which unique columns there are, and where the duplicates are.

My own array is too large to put here, but here is an example:

a = np.array([[ 1.,  0.,  0.,  0.,  0.],[ 2.,  0.,  4.,  3.,  0.],])

This has the unique column vectors [1.,2.], [0.,0.], [0.,4.] and [0.,3.]. There is one duplicate: [0.,0.] appears twice.

Now I found a way to get the unique vectors and their indices here but it is not clear to me how I would get the occurences of duplicates as well. I have tried several naive ways (with np.where and list comps) but those are all very very slow. Surely there has to be a numpythonic way?

In matlab it's just the unique function but np.unique flattens arrays.

Community
  • 1
  • 1
user2229219
  • 342
  • 4
  • 16
  • @WarrenWeckesser I linked that because it solves the problem of finding unique rows, but it does not solve the problem of finding where in the array the duplicates are located – user2229219 Oct 06 '16 at 16:06
  • Do you intend to tag duplicate columns with duplicate IDs? Or do you intend to get the count of duplicate cols? – Divakar Oct 06 '16 at 16:12
  • I would like to be able to say something like `{col1: [0], col2: [1, 4], col3: [2], col4: [3]}`, i.e. have a list of the places where each unique column appears in the array. – user2229219 Oct 06 '16 at 16:31

3 Answers3

0

Here's a vectorized approach to give us a list of arrays as output -

ids = np.ravel_multi_index(a.astype(int),a.max(1).astype(int)+1)
sidx = ids.argsort()
sorted_ids = ids[sidx]
out = np.split(sidx,np.nonzero(sorted_ids[1:] > sorted_ids[:-1])[0]+1)

Sample run -

In [62]: a
Out[62]: 
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 2.,  0.,  4.,  3.,  0.]])

In [63]: out
Out[63]: [array([1, 4]), array([3]), array([2]), array([0])]
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

The numpy_indexed package (disclaimer: I am its author) contains efficient functionality for computing these kind of things:

import numpy_indexed as npi
unique_columns = npi.unique(a, axis=1)
non_unique_column_idx = npi.multiplicity(a, axis=1) > 1

Or alternatively:

unique_columns, column_count = npi.count(a, axis=1)
duplicate_columns = unique_columns[:, column_count > 1]
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
-1

For small arrays:

    from collections import defaultdict
    indices = defaultdict(list)
    for index, column in enumerate(a.transpose()):
        indices[tuple(column)].append(index)
    unique = [kk for kk, vv in indices.items() if len(vv) == 1]
    non_unique = {kk:vv for kk, vv in indices.items() if len(vv) != 1}
Jblasco
  • 3,827
  • 22
  • 25