0

The Problem

I'm attempting to count the frequency of a list of strings and sort it in descending order. scipy.stats.itemfreq generates the frequency results which are output as a numpy array of string elements. This is where I'm stumped. How do I sort it?

So far I have tried operator.itemgetter which appeared to work for a small list until I realised that it is sorting by the first string character rather than converting the string to an integer so '5' > '11' as it is comparing 5 and 1 not 5 and 11.

I'm using python 2.7, numpy 1.8.1, scipy 0.14.0.

Example Code:

from scipy.stats import itemfreq
import operator as op

items = ['platypus duck','platypus duck','platypus duck','platypus duck','cat','dog','platypus duck','elephant','cat','cat','dog','bird','','','cat','dog','bird','cat','cat','cat','cat','cat','cat','cat']
items = itemfreq(items)
items = sorted(items, key=op.itemgetter(1), reverse=True)
print items
print items[0]

Output:

[array(['platypus duck', '5'], 
      dtype='|S13'), array(['dog', '3'], 
      dtype='|S13'), array(['', '2'], 
      dtype='|S13'), array(['bird', '2'], 
      dtype='|S13'), array(['cat', '11'], 
      dtype='|S13'), array(['elephant', '1'], 
      dtype='|S13')]
['platypus duck' '5']

Expected Output:

I'm after the ordering so something like:

[array(['cat', '11'], 
      dtype='|S13'), array(['platypus duck', '5'], 
      dtype='|S13'), array(['dog', '3'], 
      dtype='|S13'), array(['', '2'], 
      dtype='|S13'), array(['bird', '2'], 
      dtype='|S13'), array(['elephant', '1'], 
      dtype='|S13')]
['cat', '11']

Summary

My question is: how do I sort the array (which in this case is a string array) in descending order of counts? Please feel free to suggest alternative and faster/improved methods to my code sample above.

Timballisto
  • 315
  • 3
  • 14

2 Answers2

2

It is unfortunate that itemfreq returns the unique items and their counts in the same array. For your case, it means the counts are converted to strings, which is just dumb.

If you can upgrade numpy to version 1.9, then instead of using itemfreq, you can use numpy.unique with the argument return_counts=True (see below for how to accomplish this in older numpy):

In [29]: items = ['platypus duck','platypus duck','platypus duck','platypus duck','cat','dog','platypus duck','elephant','cat','cat','dog','bird','','','cat','dog','bird','cat','cat','cat','cat','cat','cat','cat']

In [30]: values, counts = np.unique(items, return_counts=True)

In [31]: values
Out[31]: 
array(['', 'bird', 'cat', 'dog', 'elephant', 'platypus duck'], 
      dtype='|S13')

In [32]: counts
Out[32]: array([ 2,  2, 11,  3,  1,  5])

Get indices that puts counts in decreasing order:

In [38]: idx = np.argsort(counts)[::-1]

In [39]: values[idx]
Out[39]: 
array(['cat', 'platypus duck', 'dog', 'bird', '', 'elephant'], 
      dtype='|S13')

In [40]: counts[idx]
Out[40]: array([11,  5,  3,  2,  2,  1])

For older versions of numpy, you can combine np.unique and np.bincount, as follows:

In [46]: values, inv = np.unique(items, return_inverse=True)

In [47]: counts = np.bincount(inv)

In [48]: values
Out[48]: 
array(['', 'bird', 'cat', 'dog', 'elephant', 'platypus duck'], 
      dtype='|S13')

In [49]: counts
Out[49]: array([ 2,  2, 11,  3,  1,  5])

In [50]: idx = np.argsort(counts)[::-1]

In [51]: values[idx]
Out[51]: 
array(['cat', 'platypus duck', 'dog', 'bird', '', 'elephant'], 
      dtype='|S13')

In [52]: counts[idx]
Out[52]: array([11,  5,  3,  2,  2,  1])

In fact, the above is exactly what itemfreq does. Here's the definition of itemfreq in the scipy source code (without the docstring):

def itemfreq(a):
    items, inv = np.unique(a, return_inverse=True)
    freq = np.bincount(inv)
    return np.array([items, freq]).T
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
0

A much simpler way of achieving your task - obtaining the frequency of an item and having the items sorted by frequency - is to use the pandas function value_counts (for the original post and more suggestions see here):

import pandas as pd
import numpy as np
x = np.array(["bird","cat","dog","dog","cat","cat"])
pd.value_counts(x)

cat     3
dog     2
bird    1
dtype: int64

Getting only the number of occurences, sorted:

y = pd.value_counts(x).values

array([3, 2, 1])

Getting only the unique names of the items you want to count, sorted:

z = pd.value_counts(x).index

Index(['cat', 'dog', 'bird'], dtype='object')
NeStack
  • 1,739
  • 1
  • 20
  • 40