Remove elements that appear more often than once from numpy array

Question

The question is, how can I remove elements that appear more often than once in an array completely. Below you see an approach that is very slow when it comes to bigger arrays. Any idea of doing this the numpy-way? Thanks in advance.

import numpy as np

count = 0
result = []
input = np.array([[1,1], [1,1], [2,3], [4,5], [1,1]]) # array with points [x, y]

# count appearance of elements with same x and y coordinate
# append to result if element appears just once

for i in input:
    for j in input:
        if (j[0] == i [0]) and (j[1] == i[1]):
            count += 1
    if count == 1:
        result.append(i)
    count = 0

print np.array(result)

UPDATE: BECAUSE OF FORMER OVERSIMPLIFICATION

Again to be clear: How can I remove elements appearing more than once concerning a certain attribute from an array/list ?? Here: list with elements of length 6, if first and second entry of every elements both appears more than once in the list, remove all concerning elements from list. Hope I'm not to confusing. Eumiro helped me a lot on this, but I don't manage to flatten the output list as it should be :(

import numpy as np 
import collections

input = [[1,1,3,5,6,6],[1,1,4,4,5,6],[1,3,4,5,6,7],[3,4,6,7,7,6],[1,1,4,6,88,7],[3,3,3,3,3,3],[456,6,5,343,435,5]]

# here, from input there should be removed input[0], input[1] and input[4] because
# first and second entry appears more than once in the list, got it? :)

d = {}

for a in input:
    d.setdefault(tuple(a[:2]), []).append(a[2:])

outputDict = [list(k)+list(v) for k,v in d.iteritems() if len(v) == 1 ]

result = []

def flatten(x):
    if isinstance(x, collections.Iterable):
        return [a for i in x for a in flatten(i)]
    else:
        return [x]

# I took flatten(x) from http://stackoverflow.com/a/2158522/1132378
# And I need it, because output is a nested list :(

for i in outputDict:
    result.append(flatten(i))

print np.array(result)

So, this works, but it's impracticable with big lists. First I got RuntimeError: maximum recursion depth exceeded in cmp and after applying sys.setrecursionlimit(10000) I got Segmentation fault how could I implement Eumiros solution for big lists > 100000 elements?

eumiro · Accepted Answer · 2012-02-04T13:41:31.027

3

np.array(list(set(map(tuple, input))))

returns

array([[4, 5],
       [2, 3],
       [1, 1]])

UPDATE 1: If you want to remove the [1, 1] too (because it appears more than once), you can do:

from collections import Counter

np.array([k for k, v in Counter(map(tuple, input)).iteritems() if v == 1])

returns

array([[4, 5],
       [2, 3]])

UPDATE 2: with input=[[1,1,2], [1,1,3], [2,3,4], [4,5,5], [1,1,7]]:

input=[[1,1,2], [1,1,3], [2,3,4], [4,5,5], [1,1,7]]

d = {}
for a in input:
    d.setdefault(tuple(a[:2]), []).append(a[2])

d is now:

{(1, 1): [2, 3, 7],
 (2, 3): [4],
 (4, 5): [5]}

so we want to take all key-value pairs, that have single values and re-create the arrays:

np.array([k+tuple(v) for k,v in d.iteritems() if len(v) == 1])

returns:

array([[4, 5, 5],
       [2, 3, 4]])

UPDATE 3: For larger arrays, you can adapt my previous solution to:

import numpy as np
input = [[1,1,3,5,6,6],[1,1,4,4,5,6],[1,3,4,5,6,7],[3,4,6,7,7,6],[1,1,4,6,88,7],[3,3,3,3,3,3],[456,6,5,343,435,5]]
d = {}
for a in input:
    d.setdefault(tuple(a[:2]), []).append(a)
np.array([v for v in d.itervalues() if len(v) == 1])

returns:

array([[[456,   6,   5, 343, 435,   5]],
       [[  1,   3,   4,   5,   6,   7]],
       [[  3,   4,   6,   7,   7,   6]],
       [[  3,   3,   3,   3,   3,   3]]])

edited Feb 04 '12 at 13:41

answered Feb 03 '12 at 12:10

eumiro

207,213
34
299
261

Thank you. My problem is (besides understanding the functions you were using :), I'd like to remove ALL elements, all [1, 1]'s in this example. That's because there is more known than x- and y-coordinates in my real problem and I don't want to risk to prefer the wrong point, so I decides to leave non of the duplicates. Hope that wasn't too confusing :) – feinmann Feb 03 '12 at 12:16
@feinmann - see my updated answer. everything is standard library, so you can find the explanation of different functions in Python doc. – eumiro Feb 03 '12 at 12:21
Thank you very much eumiro. Well done! But I'm afraid I oversimplified my problem. Could you think of a solution if there is a third component (actually there are six components), that should not affect the "remove-descission". like: input=[[1,1,2], [1,1,3], [2,3,4], [4,5,5], [1,1,7]] should result=[[2,3,4],[4,5,5]]. Hope you're not angry on me :) – feinmann Feb 03 '12 at 12:28
for input = [[1,1,3,5,6,6],[1,1,4,4,5,6],[1,3,4,5,6,7],[3,4,6,7,7,6],[1,1,4,6,88,7],[3,3,3,3,3,3],[456,6,5,343,435,5]] I get result=[[456, 6, [5, 343, 435, 5]], [1, 3, [4, 5, 6, 7]], [3, 4, [6, 7, 7, 6]], [3, 3, [3, 3, 3, 3]]] and that is terrible. I dont get to lose the inner brackets :(( I'm sorry eumiro, could you please help me one more time ? thank you so much.... – feinmann Feb 03 '12 at 13:36
The first expression can be shortened to `np.unique(map(tuple, input))`. – Fred Foo Feb 03 '12 at 16:20

score 3 · Answer 2 · answered Feb 03 '12 at 21:51

This is a corrected, faster version of Hooked's answer. count_unique counts the number of the number of occurrences for each unique key in keys.

import numpy as np
input = np.array([[1,1,3,5,6,6],
                  [1,1,4,4,5,6],
                  [1,3,4,5,6,7],
                  [3,4,6,7,7,6],
                  [1,1,4,6,88,7],
                  [3,3,3,3,3,3],
                  [456,6,5,343,435,5]])

def count_unique(keys):
    """Finds an index to each unique key (row) in keys and counts the number of
    occurrences for each key"""
    order = np.lexsort(keys.T)
    keys = keys[order]
    diff = np.ones(len(keys)+1, 'bool')
    diff[1:-1] = (keys[1:] != keys[:-1]).any(-1)
    count = np.where(diff)[0]
    count = count[1:] - count[:-1]
    ind = order[diff[1:]]
    return ind, count

key = input[:, :2]
ind, count = count_unique(key)
print key[ind]
#[[  1   1]
# [  1   3]
# [  3   3]
# [  3   4]
# [456   6]]
print count
[3 1 1 1 1]

ind = ind[count == 1]
output = input[ind]
print output
#[[  1   3   4   5   6   7]
# [  3   3   3   3   3   3]
# [  3   4   6   7   7   6]
# [456   6   5 343 435   5]]

thanks for the heads up on lexsort, it's always nice learn something when your answer is improved upon! — Hooked, Feb 03 '12 at 22:17

Hooked · Answer 3 · 2012-02-03T16:25:59.093

1

Updated Solution:

From the comments below, the new solution is:

idx = argsort(A[:, 0:2], axis=0)[:,1]
kidx = where(sum(A[idx,:][:-1,0:2]!=A[idx,:][1:,0:2], axis=1)==0)[0]
kidx = unique(concatenate((kidx,kidx+1)))

for n in arange(0,A.shape[0],1):
    if n not in kidx:
        print A[idx,:][n]

 > [1 3 4 5 6 7]
   [3 3 3 3 3 3]
   [3 4 6 7 7 6]
   [456   6   5 343 435   5]

kidx is a index list of the elements you don't want. This preserves rows where the first two inner elements do not match any other inner element. Since everything is done with indexing, it should be fast(ish), though it requires a sort on the first two elements. Note that original row order is not preserved, though I don't think this is a problem.

Old Solution:

If I understand it correctly, you simply want to filter out the results of a list of lists where the first element of each inner list is equal to the second element.

With your input from your update A=[[1,1,3,5,6,6],[1,1,4,4,5,6],[1,3,4,5,6,7],[3,4,6,7,7,6],[1,1,4,6,88,7],[3,3,3,3,3,3],[456,6,5,343,435,5]], the following line removes A[0],A[1] and A[4]. A[5] is also removed since that seems to match your criteria.

[x for x in A if x[0]!=x[1]]

If you can use numpy, there is a really slick way of doing the above. Assume that A is an array, then

A[A[0,:] == A[1,:]]

Will pull out the same values. This is probably faster than the solution listed above if you want to loop over it.

edited Feb 03 '12 at 16:25

answered Feb 03 '12 at 14:55

Hooked

84,485
43
192
261

Sorry, you got me wrong unfortunately. What I meant was not if the first is equal to the second. I am afraid, my example satisfies this condition. Elements should be removed if their first AND second entries appear more than once in the list (but exactly in this combination)... so there is a list=[[x1,x2,x3,x4,x5,x6],[y1,y2,y3,y4,y5,y6],...]. Remove elements both if x1==y1 AND x2==y2. – feinmann Feb 03 '12 at 15:08
@feinmann, I think the confusion here is that there are _two_ lists, an outer and inner and we are mixing them up. To be clear, if the first two inner elements of each outer list element match with any other - then they are to be removed? If so, I'll edit my answer accordingly. – Hooked Feb 03 '12 at 15:20
you are absolutly right. if the first two inner elements of every outer element appear in the whole list more than once in this combination, remove all outer elements concerning. many thanks in advance. I tried for myself just to count the appearance and only take those outer elements being in the list once (concerning their first two inner elements).. but it is unacceptable slow :( – feinmann Feb 03 '12 at 15:23
@Hooked, this is close, but argsort isn't doing what you want here. It's sorting each column independently and then you're ignoring the first column and taking the second. You want something more like lexsort. – Bi Rico Feb 03 '12 at 21:46

score -2 · Answer 4 · edited May 19 '15 at 07:40

-2

Why not create another array to hold the output?

Iterate through your main list and for each i check if i is in your other array and if not append it.

This way, your new array will not contain more than one of each element

edited May 19 '15 at 07:40

Stefan van den Akker

6,661
7
48
63

answered May 19 '15 at 06:41

Sankalp Ramesh

1

Remove elements that appear more often than once from numpy array

UPDATE: BECAUSE OF FORMER OVERSIMPLIFICATION

4 Answers4

Updated Solution:

Old Solution: