2

I am looking for a function that gets a one dimensional sorted array and returns a two dimensional array with two columns, first column containing non-repeated items and second column containing number of repetiotions of the item. Right now my code is as follows:

def priorsGrouper(priors):
    if priors.size==0:
        ret=priors;
    elif priors.size==1:
        ret=priors[0],1;
    else:
        ret=numpy.zeros((1,2));
        pointer1,pointer2=0,0;
        while(pointer1<priors.size):
            counter=0;
            while(pointer2<priors.size and priors[pointer2]==priors[pointer1]):
                counter+=1;
                pointer2+=1;
            ret=numpy.row_stack((ret,[priors[pointer1],pointer2-pointer1]))
            pointer1=pointer2;
    return ret;
print priorsGrouper(numpy.array([1,2,2,3]))    

My output is as follows:

[[ 0.  0.]
 [ 1.  1.]
 [ 2.  2.]
 [ 3.  1.]]

First of all I cannot get rid of my [0,0]. Secondly I want to know if there is a numpy or scipy function for this or is mine OK?

Thanks.

Cupitor
  • 11,007
  • 19
  • 65
  • 91
  • 1
    If the first column of the result array has "non repeated items", how can the second column have "number of repetitions of the item"? – Mariano Mar 07 '13 at 17:44
  • I want the output to be structured like that. I'll add an example. – Cupitor Mar 07 '13 at 17:47
  • Right. Sorry searched but was not able to find it. Do I have to delete this one? – Cupitor Mar 07 '13 at 18:04
  • No, not at all. It's just a way of saying "Your answer may already be over here." Even closed duplicates can be helpful because they point back to the original. – jscs Mar 07 '13 at 18:55

4 Answers4

5

You could use np.unique to get the unique values in x, as well as an array of indices (called inverse). The inverse can be thought of as "labels" for the elements in x. Unlike x itself, the labels are always integers, starting at 0.

Then you can take a bincount of the labels. Since the labels start at 0, the bincount won't be filled with a lot of zeros that you don't care about.

Finally, column_stack will join y and the bincount into a 2D array:

In [84]: x = np.array([1,2,2,3])

In [85]: y, inverse = np.unique(x, return_inverse=True)

In [86]: y
Out[86]: array([1, 2, 3])

In [87]: inverse
Out[87]: array([0, 1, 1, 2])

In [88]: np.bincount(inverse)
Out[88]: array([1, 2, 1])

In [89]: np.column_stack((y,np.bincount(inverse)))
Out[89]: 
array([[1, 1],
       [2, 2],
       [3, 1]])

Sometimes when an array is small, it turns out that using plain Python methods are faster than NumPy functions. I wanted to check if that was the case here, and, if so, how large x would have to be before NumPy methods are faster.

Here is a graph of the performance of various methods as a function of the size of x: enter image description here

In [173]: x = np.random.random(1000)

In [174]: x.sort()

In [156]: %timeit using_unique(x)
10000 loops, best of 3: 99.7 us per loop

In [180]: %timeit using_groupby(x)
100 loops, best of 3: 3.64 ms per loop

In [157]: %timeit using_counter(x)
100 loops, best of 3: 4.31 ms per loop

In [158]: %timeit using_ordered_dict(x)
100 loops, best of 3: 4.7 ms per loop

For len(x) of 1000, using_unique is over 35x faster than any of the plain Python methods tested.

So it looks like using_unique is fastest, even for very small len(x).


Here is the program used to generate the graph:

import numpy as np
import collections
import itertools as IT
import matplotlib.pyplot as plt
import timeit

def using_unique(x):
    y, inverse = np.unique(x, return_inverse=True)
    return np.column_stack((y, np.bincount(inverse)))

def using_counter(x):
    result = collections.Counter(x)
    return np.array(sorted(result.items()))

def using_ordered_dict(x):
    result = collections.OrderedDict()
    for item in x:
        result[item] = result.get(item,0)+1
    return np.array(result.items())

def using_groupby(x):
    return np.array([(k, sum(1 for i in g)) for k, g in IT.groupby(x)])

fig, ax = plt.subplots()
timing = collections.defaultdict(list)
Ns = [int(round(n)) for n in np.logspace(0, 3, 10)]
for n in Ns:
    x = np.random.random(n)
    x.sort()
    timing['unique'].append(
        timeit.timeit('m.using_unique(m.x)', 'import __main__ as m', number=1000))
    timing['counter'].append(
        timeit.timeit('m.using_counter(m.x)', 'import __main__ as m', number=1000))
    timing['ordered_dict'].append(
        timeit.timeit('m.using_ordered_dict(m.x)', 'import __main__ as m', number=1000))
    timing['groupby'].append(
        timeit.timeit('m.using_groupby(m.x)', 'import __main__ as m', number=1000))

ax.plot(Ns, timing['unique'], label='using_unique')
ax.plot(Ns, timing['counter'], label='using_counter')
ax.plot(Ns, timing['ordered_dict'], label='using_ordered_dict')
ax.plot(Ns, timing['groupby'], label='using_groupby')
plt.legend(loc='best')
plt.ylabel('milliseconds')
plt.xlabel('size of x')
plt.show()
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
3

If order is not important, use Counter.

from collections import Counter
% Counter([1,2,2,3])
= Counter({2: 2, 1: 1, 3: 1})
% Counter([1,2,2,3]).items()
[(1, 1), (2, 2), (3, 1)]

To preserve order (by first appearance), you can implement your own version of Counter:

from collections import OrderedDict
def OrderedCounter(seq):
     res = OrderedDict()
     for x in seq:
        res.setdefault(x, 0) 
        res[x] += 1
     return res
% OrderedCounter([1,2,2,3])
= OrderedDict([(1, 1), (2, 2), (3, 1)])
% OrderedCounter([1,2,2,3]).items()
= [(1, 1), (2, 2), (3, 1)]
shx2
  • 61,779
  • 13
  • 130
  • 153
1

If you want to count repetitions of an item you can use a dictionary:

l = [1, 2, 2, 3]
d = {}
for i in l:
    if i not in d:
        d[i] = 1
    else:
        d[i] += 1
result = [[k, v] for k, v in d.items()]

For your example returns:

[[1, 1],
 [2, 2], 
 [3, 1]]

Good luck.

Mariano
  • 1,357
  • 3
  • 17
  • 30
0

First of all, you don't need to end your statements with semicolons (;), this isn't C. :-)

Second, line 5 (and others) set ret to be value,value but that isn't a list:

>type foo.py
def foo():
        return [1],2
a,b = foo()
print "a = {0}".format(a)
print "b = {0}".format(b)

Gives:

>python foo.py
a = [1]
b = 2

Third: there are easier ways to do this, here's what comes to mind:

  • Use the Set constructor to create a unique list of items
  • Create a list of the number of times each entry in Set occurs in the input string
  • Use zip() to combine and return the two lists as set of tuples (although this isn't exactly what you were asking for)

Here's one way:

def priorsGrouper(priors):
    """Find out how many times each element occurs in a list.

    @param[in] priors List of elements
    @return Two-dimensional list: first row is the unique elements,
                second row is the number of occurrences of each element.
    """

    # Generate a `list' containing only unique elements from the input
    mySet = set(priors)

    # Create the list that will store the number of occurrences
    occurrenceCounts = []

    # Count how many times each element occurs on the input:
    for element in mySet:
        occurrenceCounts.append(priors.count(element))

    # Combine the two:
    combinedArray = zip(mySet, occurrenceCounts)
# End of priorsGrouper() ----------------------------------------------

# Check zero-element case
print priorsGrouper([])

# Check multi-element case
sampleInput = ['a','a', 'b', 'c', 'c', 'c']
print priorsGrouper(sampleInput)
neilr8133
  • 142
  • 6