1

I'm very badly stuck, and every pythonista I've asked can't seem to help.

I'm using vstack to create an array of vectors in a loop like this:

Corr = np.vstack((Corr, S))

I need to remove repeating vectors so that it is an array of unique vectors and to compare all of these vectors.

I know that this comparison can be done in lists, but I have not found a way to append full vectors to a list.

This is the result (I've marked unique vectors with unique letters):

Corr = [[ 0.  0.  0.  0. -2.  4.  4.  2.  2.] #a
 [-4. -4. -4. -4.  2.  4.  4.  2.  2.]#b
 [-4.  0.  0.  4. -2.  0.  0. -2.  2.]#c
 [ 0. -4. -4.  0.  2.  0.  0. -2.  2.]#d
 [ 0. -4.  4.  0. -2.  0.  0.  2. -2.]#e
 [-4.  0.  0. -4.  2.  0.  0.  2. -2.]#f
 [-4. -4.  4.  4. -2.  4. -4. -2. -2.]#g
 [ 0.  0.  0.  0.  2.  4. -4. -2. -2.]#h
 [ 0.  4. -4.  0. -2.  0.  0.  2. -2.]#i
 [-4.  0.  0. -4.  2.  0.  0.  2. -2.]#f
 [-4.  4. -4.  4. -2. -4.  4. -2. -2.]#j
 [ 0.  0.  0.  0.  2. -4.  4. -2. -2.]#k
 [ 0.  0.  0.  0. -2. -4. -4.  2.  2.]#l
 [-4.  4.  4. -4.  2. -4. -4.  2.  2.]#m
 [-4.  0.  0.  4. -2.  0.  0. -2.  2.]#n
 [ 0.  4.  4.  0.  2.  0.  0. -2.  2.]#o
 [ 4.  0.  0. -4. -2.  0.  0. -2.  2.]#c
 [ 0. -4. -4.  0.  2.  0.  0. -2.  2.]#d
 [ 0.  0.  0.  0. -2. -4. -4.  2.  2.]#p
 [ 4. -4. -4.  4.  2. -4. -4.  2.  2.]#q
 [ 4. -4.  4. -4. -2. -4.  4. -2. -2.]#r
 [ 0.  0.  0.  0.  2. -4.  4. -2. -2.]#k
 [ 0. -4.  4.  0. -2.  0.  0.  2. -2.]#e
 [ 4.  0.  0.  4.  2.  0.  0.  2. -2.]#s
 [ 4.  4. -4. -4. -2.  4. -4. -2. -2.]#t
 [ 0.  0.  0.  0.  2.  4. -4. -2. -2.]#h
 [ 0.  4. -4.  0. -2.  0.  0.  2. -2.]#i
 [ 4.  0.  0.  4.  2.  0.  0.  2. -2.]#s
 [ 4.  0.  0. -4. -2.  0.  0. -2.  2.]#u
 [ 0.  4.  4.  0.  2.  0.  0. -2.  2.]#o
 [ 0.  0.  0.  0. -2.  4.  4.  2.  2.]]#a

I don't know why vstack is adding a period instead of a comma (in the loops each vector S has a comma when I print it separately!).

I need the end result to be an array of unique vectors, (so in this case it'll be vectors a-u ie, 21 vectors).

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 1
    As a side note, calling vstack in a loop to build up a matrix one-by-one is poor practice. Instead, create a regular old list of all your vectors and then combine them all at once. This saves tons of copying if the number of vectors is large. – John Zwinck Nov 14 '15 at 14:13
  • John, the "Find unique rows" link you mentioned also has a number of vstack solutions. And the vector elements are separated by comments. I'm using vstack and there are periods showing up. Not being able to use any of those methods to sort through the array. – user3625380 Nov 14 '15 at 14:47
  • The "periods showing up" are not a problem, they're normal. They show you that your numbers are floats not ints. In a Python list it prints by default with commas between elements. The dots-vs-commas has no impact on what sort of data is inside, it's just a display thing. – John Zwinck Nov 14 '15 at 14:58

3 Answers3

3

If you convert your vectors to tuples, you can put them in a set which will automatically discard duplicates. For example:

unique_vectors = set(map(tuple, Corr))

array_of_unique_vectors = np.array(list(unique_vectors))

Edit: I was curious, so I quickly benchmarked the three proposed solutions here. The results are the same up to the order of the returned elements, and it appears that the Pandas drop_duplicates method outperforms the others.

import numpy as np
import pandas as pd

def unique_set(a):
    return np.vstack(set(map(tuple, a)))

def unique_numpy(a):
    a = np.ascontiguousarray(a)
    view = a.view(np.dtype(('void', a.itemsize * a.shape[1])))
    unique = np.unique(view)
    return unique.view(a.dtype).reshape(-1, a.shape[1])

def unique_pandas(a):
    return pd.DataFrame(a).drop_duplicates().values

a = np.random.randint(0, 5, (100000, 5))

%timeit unique_set(a)
10 loops, best of 3: 183 ms per loop

%timeit unique_numpy(a)
10 loops, best of 3: 43.1 ms per loop

%timeit unique_pandas(a)
100 loops, best of 3: 10.3 ms per loop
jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • This only works if you the data is not very large and you don't care about optimal performance. – John Zwinck Nov 14 '15 at 14:15
  • Actually, I think it's pretty close to optimal. What you want is a hash table to do comparisons of new vectors in O[1], and ``set()`` is the easiest way to build a hash table to drop conflicts. In Python 3, ``map()`` is a generator, so you're not going to face huge memory issues either. In Python 2 ``itertools.imap`` would be better. – jakevdp Nov 14 '15 at 14:19
  • Are you a regular user of NumPy? Are you familiar with how it works and why people use it? – John Zwinck Nov 14 '15 at 14:20
  • Yes and yes. If there is a vectorized way to compute a hash table within numpy, I'm not aware of it. – jakevdp Nov 14 '15 at 14:22
  • My data will be blowing up as I go to higher and higher numbers. Is there another optimal solution for this? – user3625380 Nov 14 '15 at 14:48
  • See the other solution I just posted, which avoids data duplication. – jakevdp Nov 14 '15 at 14:49
  • This is fantastic, though, I just used it. First I converted Corr to a list using np.array(Corr).tolist(), then used your suggestion jakevdp. Thank you so much. How would you suggest I use it for much bigger data sets, John? – user3625380 Nov 14 '15 at 14:53
  • Holy hell jakevdp, you are my hero. Thank you. – user3625380 Nov 14 '15 at 16:52
2

Here's an answer that avoids data duplication and doesn't require external packages like Pandas:

Corr = np.ascontiguousarray(Corr)
view = Corr.view(np.dtype(('void', Corr.itemsize * Corr.shape[1])))
unique_view = np.unique(view)
unique = unique_view.view(Corr.dtype).reshape(-1, Corr.shape[1])

I find it to be about 5 times faster than the set-of-tuple solution I previously proposed.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
0

Pandas has a fairly direct solution to your problem--the drop_duplicates function:

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop_duplicates.html

John Zwinck
  • 239,568
  • 38
  • 324
  • 436