Python pandas - groupby() skips repeated values in Dataframe

Question

I've got a pandas dataframe which I am using the groupby() function on to group things the way I want, except pandas is skipping repeated values, only showing unique values instead.

Here is a sample dataframe

data = [ 
    ['American Mathematical Society', 'Journal', 2, 'Mathematics & Statistics'],
    ['American Mathematical Society', 'Journal', 2, 'Mathematics & Statistics'],
    ['American Mathematical Society', 'Journal', 38, 'Mathematics & Statistics'],
    ['American Mathematical Society', 'Journal', 4, 'Mathematics & Statistics']]

df = pd.DataFrame(data, columns = ['Provider', 'Type', 'Downloads JR1 2017', 'Field'])

Now I use the groupby function to group these the way I like in a list.

jr1_provider = df.groupby(['Provider', 'Field', 'Downloads JR1 2017'], as_index=False).sum().values.tolist()

Here is the output:

[['American Mathematical Society', 'Mathematics & Statistics', 2, 'JournalJournal'], ['American Mathematical Society', 'Mathematics & Statistics', 4, 'Journal'], ['American Mathematical Society', 'Mathematics & Statistics', 38, 'Journal']]

However, there should be 4 items in the output. Instead I have only 3. I see that duplicate values have been removed from the results because two of the rows have value '2' in the 'Downloads JR1 2017' column.

Why? And how can I get all results returned?

The output I want to get to would be the name of the 'provider', with a sum of the 'Downloads JR1 2017'. Example:

['American Mathematical Society', 46]

For pandas questions, it always helps to see sample input and output. See [creating good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — G. Anderson, Jun 21 '19 at 16:19
It grouped the 'Downloads JR1 2017' column by values because you included it in the groupby columns. If you don't include it, then it won't be included in the grouping. — G. Anderson, Jun 21 '19 at 17:46
Ok, I see that now. Thanks. But I am ultimately trying to sum all the values in the 'Downloads JR1 2017' column. So it looks like I am on the wrong track... — Erich Purpur, Jun 21 '19 at 17:54

score 3 · Answer 1 · answered Jun 21 '19 at 16:00

3

So you can check transform

jr1_provider = provider_subset.groupby(['Provider', 'Field', 'Downloads JR1 2017'], as_index=False).transform('sum').values.tolist()

answered Jun 21 '19 at 16:00

BENY

317,841
20
164
234

I am now getting a value error: Length mismatch: Expected axis has 13 elements, new values have 14 elements – Erich Purpur Jun 21 '19 at 17:45
@ErichPurpur maybe show us some data , I am not sure why you have 14 row rather than 13 , with transform , you should produced the same length result as df – BENY Jun 21 '19 at 17:59

score 2 · Accepted Answer · answered Jun 21 '19 at 18:01

2

Based on your additional detail in the comments, how about

df.groupby(['Provider', 'Field'], as_index=False).sum()

answered Jun 21 '19 at 18:01

G. Anderson

5,815
2
14
21

Python pandas - groupby() skips repeated values in Dataframe

2 Answers2