Grouping repetitions in an array?

Question

I am looking for a function that gets a one dimensional sorted array and returns a two dimensional array with two columns, first column containing non-repeated items and second column containing number of repetiotions of the item. Right now my code is as follows:

def priorsGrouper(priors):
    if priors.size==0:
        ret=priors;
    elif priors.size==1:
        ret=priors[0],1;
    else:
        ret=numpy.zeros((1,2));
        pointer1,pointer2=0,0;
        while(pointer1<priors.size):
            counter=0;
            while(pointer2<priors.size and priors[pointer2]==priors[pointer1]):
                counter+=1;
                pointer2+=1;
            ret=numpy.row_stack((ret,[priors[pointer1],pointer2-pointer1]))
            pointer1=pointer2;
    return ret;
print priorsGrouper(numpy.array([1,2,2,3]))

My output is as follows:

[[ 0.  0.]
 [ 1.  1.]
 [ 2.  2.]
 [ 3.  1.]]

First of all I cannot get rid of my [0,0]. Secondly I want to know if there is a numpy or scipy function for this or is mine OK?

Thanks.

If the first column of the result array has "non repeated items", how can the second column have "number of repetitions of the item"? — Mariano, Mar 07 '13 at 17:44
I want the output to be structured like that. I'll add an example. — Cupitor, Mar 07 '13 at 17:47
Right. Sorry searched but was not able to find it. Do I have to delete this one? — Cupitor, Mar 07 '13 at 18:04
No, not at all. It's just a way of saying "Your answer may already be over here." Even closed duplicates can be helpful because they point back to the original. — jscs, Mar 07 '13 at 18:55

unutbu · Accepted Answer · 2013-03-07T21:30:09.017

You could use np.unique to get the unique values in x, as well as an array of indices (called inverse). The inverse can be thought of as "labels" for the elements in x. Unlike x itself, the labels are always integers, starting at 0.

Then you can take a bincount of the labels. Since the labels start at 0, the bincount won't be filled with a lot of zeros that you don't care about.

Finally, column_stack will join y and the bincount into a 2D array:

In [84]: x = np.array([1,2,2,3])

In [85]: y, inverse = np.unique(x, return_inverse=True)

In [86]: y
Out[86]: array([1, 2, 3])

In [87]: inverse
Out[87]: array([0, 1, 1, 2])

In [88]: np.bincount(inverse)
Out[88]: array([1, 2, 1])

In [89]: np.column_stack((y,np.bincount(inverse)))
Out[89]: 
array([[1, 1],
       [2, 2],
       [3, 1]])

Sometimes when an array is small, it turns out that using plain Python methods are faster than NumPy functions. I wanted to check if that was the case here, and, if so, how large x would have to be before NumPy methods are faster.

Here is a graph of the performance of various methods as a function of the size of x: enter image description here

In [173]: x = np.random.random(1000)

In [174]: x.sort()

In [156]: %timeit using_unique(x)
10000 loops, best of 3: 99.7 us per loop

In [180]: %timeit using_groupby(x)
100 loops, best of 3: 3.64 ms per loop

In [157]: %timeit using_counter(x)
100 loops, best of 3: 4.31 ms per loop

In [158]: %timeit using_ordered_dict(x)
100 loops, best of 3: 4.7 ms per loop

For len(x) of 1000, using_unique is over 35x faster than any of the plain Python methods tested.

So it looks like using_unique is fastest, even for very small len(x).

Here is the program used to generate the graph:

import numpy as np
import collections
import itertools as IT
import matplotlib.pyplot as plt
import timeit

def using_unique(x):
    y, inverse = np.unique(x, return_inverse=True)
    return np.column_stack((y, np.bincount(inverse)))

def using_counter(x):
    result = collections.Counter(x)
    return np.array(sorted(result.items()))

def using_ordered_dict(x):
    result = collections.OrderedDict()
    for item in x:
        result[item] = result.get(item,0)+1
    return np.array(result.items())

def using_groupby(x):
    return np.array([(k, sum(1 for i in g)) for k, g in IT.groupby(x)])

fig, ax = plt.subplots()
timing = collections.defaultdict(list)
Ns = [int(round(n)) for n in np.logspace(0, 3, 10)]
for n in Ns:
    x = np.random.random(n)
    x.sort()
    timing['unique'].append(
        timeit.timeit('m.using_unique(m.x)', 'import __main__ as m', number=1000))
    timing['counter'].append(
        timeit.timeit('m.using_counter(m.x)', 'import __main__ as m', number=1000))
    timing['ordered_dict'].append(
        timeit.timeit('m.using_ordered_dict(m.x)', 'import __main__ as m', number=1000))
    timing['groupby'].append(
        timeit.timeit('m.using_groupby(m.x)', 'import __main__ as m', number=1000))

ax.plot(Ns, timing['unique'], label='using_unique')
ax.plot(Ns, timing['counter'], label='using_counter')
ax.plot(Ns, timing['ordered_dict'], label='using_ordered_dict')
ax.plot(Ns, timing['groupby'], label='using_groupby')
plt.legend(loc='best')
plt.ylabel('milliseconds')
plt.xlabel('size of x')
plt.show()

shx2 · Answer 2 · 2013-03-07T17:56:42.290

3

If order is not important, use Counter.

from collections import Counter
% Counter([1,2,2,3])
= Counter({2: 2, 1: 1, 3: 1})
% Counter([1,2,2,3]).items()
[(1, 1), (2, 2), (3, 1)]

To preserve order (by first appearance), you can implement your own version of Counter:

from collections import OrderedDict
def OrderedCounter(seq):
     res = OrderedDict()
     for x in seq:
        res.setdefault(x, 0) 
        res[x] += 1
     return res
% OrderedCounter([1,2,2,3])
= OrderedDict([(1, 1), (2, 2), (3, 1)])
% OrderedCounter([1,2,2,3]).items()
= [(1, 1), (2, 2), (3, 1)]

edited Mar 07 '13 at 17:56

answered Mar 07 '13 at 17:47

shx2

61,779
13
130
153

It makes an unsorted output! – Cupitor Mar 07 '13 at 17:50
1

Right. do you want it to be sorted by order of first appearance? – shx2 Mar 07 '13 at 17:51
Yes, and I think sorting the output would be an overhead! – Cupitor Mar 07 '13 at 17:57
1

I edited my answer to include a solution for that (not involving sorting of course) – shx2 Mar 07 '13 at 18:01

score 1 · Answer 3 · answered Mar 07 '13 at 17:49

If you want to count repetitions of an item you can use a dictionary:

l = [1, 2, 2, 3]
d = {}
for i in l:
    if i not in d:
        d[i] = 1
    else:
        d[i] += 1
result = [[k, v] for k, v in d.items()]

For your example returns:

[[1, 1],
 [2, 2], 
 [3, 1]]

Good luck.

score 0 · Answer 4 · answered Mar 07 '13 at 18:20

First of all, you don't need to end your statements with semicolons (;), this isn't C. :-)

Second, line 5 (and others) set ret to be value,value but that isn't a list:

>type foo.py
def foo():
        return [1],2
a,b = foo()
print "a = {0}".format(a)
print "b = {0}".format(b)

Gives:

>python foo.py
a = [1]
b = 2

Third: there are easier ways to do this, here's what comes to mind:

Use the Set constructor to create a unique list of items
Create a list of the number of times each entry in Set occurs in the input string
Use zip() to combine and return the two lists as set of tuples (although this isn't exactly what you were asking for)

Here's one way:

def priorsGrouper(priors):
    """Find out how many times each element occurs in a list.

    @param[in] priors List of elements
    @return Two-dimensional list: first row is the unique elements,
                second row is the number of occurrences of each element.
    """

    # Generate a `list' containing only unique elements from the input
    mySet = set(priors)

    # Create the list that will store the number of occurrences
    occurrenceCounts = []

    # Count how many times each element occurs on the input:
    for element in mySet:
        occurrenceCounts.append(priors.count(element))

    # Combine the two:
    combinedArray = zip(mySet, occurrenceCounts)
# End of priorsGrouper() ----------------------------------------------

# Check zero-element case
print priorsGrouper([])

# Check multi-element case
sampleInput = ['a','a', 'b', 'c', 'c', 'c']
print priorsGrouper(sampleInput)

Grouping repetitions in an array?

4 Answers4