Groupby in a list for python

Question

Given a large dataset of one million records, I am looking for ways to do a group by. I am new to python, but i know in SQL there's a groupby function and i am guessing it might be applicable.

What i want to achieve is this,

From

["A", 4]
["B", 4]
["F", 3]
["A", 4]
["B", 1]

To

["A", (4,4)]
["B", (1,4)]
["F", (3)]

I am also looking for an efficient way to calculate the average of the list of ratings. So finally the output should be:

["A", 4]
["B", 2.5]
["F", 3]

I've tried to do a iterative approach to it but the error thrown was "there was too much data to unpack". Here is my solution which is not workng for the dataset.

len = max(key for (item, key) in results)
newList = [[] for i in range(len+1)]
for item, key in results: 
    newList[key].append(item)

I am looking for efficient way to do it, is there a way to do a groupby in list comprehension? Thanks!

The error means that your dataset **is not** of the form `[(x, y), ...]`. Are you sure that `results` is an iterable of `(x, y)` pairs? — freakish, Jan 20 '16 at 17:14

score 3 · Answer 1 · edited May 23 '17 at 12:30

I think the following would be a reasonable approach for a small dataset.

from collections import defaultdict

ls = [
    ["A", 4],
    ["B", 4],
    ["F", 3],
    ["A", 4],
    ["B", 1],
]

d = defaultdict(list)
for key, val in ls:
        d[key].append(val)

# Prints [['A', (4, 4)], ['B', (4, 1)], ['F', (3,)]]
print [[k, tuple(l)] for k, l in d.items()]   

# prints [['A', 4.0], ['B', 2.5], ['F', 3.0]]  
print [[k, float(sum(l))/len(l)] for k, l in d.items()] #*

*In Python 2.x, use iteritems() instead of items(), see this answer.

Slightly better, if you only care about the averages, you won't need to store everything that a key maps to:

d = defaultdict(lambda: (0, 0))
for key, val in ls:
    cnt, total = d[key]
    d[key] = (cnt + 1, total + val)

print [[k, float(total) / cnt] for k, (cnt, total) in d.items()]

Josep Valls · Accepted Answer · 2016-01-20T17:55:15.133

There is indeed a groupby method in itertools, just be aware that it requires the data to be sorted beforehand, see the documentation here https://docs.python.org/2/library/itertools.html#itertools.groupby

But from the code you posted, it looks like you don't really need to group, you just want to count, right? Then you may better use collections.Counter. Note that it requires the items to be hashable so you'd want to convert those lists into tuples.

>>> lst = [tuple(i) for i in ls]
>>> collections.Counter(lst)
Counter({('A', 4): 2, ('F', 3): 1, ('B', 1): 1, ('B', 4): 1})

Regarding efficiency... Not sure you'd fare very well loading the whole dataset in memory but you could use the defaultdict approach described by Vlad with an iterator.

About the averages, if you really want to use groupby then you could do something like this:

>>> def average(lst):
...     return 1.0*sum(lst)/len(lst) if lst else 0.0
>>> [(i[0],average([j[1] for j in i[1]])) for i in itertools.groupby(sorted(ls),key=lambda i:i[0])]
[('A', 4.0), ('B', 2.5), ('F', 3.0)]

score 0 · Answer 3 · answered Jan 20 '16 at 18:22

You may want to get used to dealing with a spreadsheet type interface for data like this. This is a larger implementation than you asked for but then graphing and charting will be easier in the long run. This example uses pandas and numpy.

Highlight the data from this question and copy it:

name value
A 4
B 4
F 3
A 4
B 1

You can get into ipython and start typing the setup for this operation.

import pandas as pd
import numpy as np

data= pd.from_clipboard()

Now comes the interesting part. you can use a pivot table which can group all those values by any function you want.

pd.pivot_table(data=data, index='name', aggfunc=np.mean)

which returns

      value
name       
A       4.0
B       2.5
F       3.0

Groupby in a list for python

3 Answers3