0

I've got a pandas dataframe which I am using the groupby() function on to group things the way I want, except pandas is skipping repeated values, only showing unique values instead.

Here is a sample dataframe

data = [ 
    ['American Mathematical Society', 'Journal', 2, 'Mathematics & Statistics'],
    ['American Mathematical Society', 'Journal', 2, 'Mathematics & Statistics'],
    ['American Mathematical Society', 'Journal', 38, 'Mathematics & Statistics'],
    ['American Mathematical Society', 'Journal', 4, 'Mathematics & Statistics']]

df = pd.DataFrame(data, columns = ['Provider', 'Type', 'Downloads JR1 2017', 'Field'])

Now I use the groupby function to group these the way I like in a list.

jr1_provider = df.groupby(['Provider', 'Field', 'Downloads JR1 2017'], as_index=False).sum().values.tolist()

Here is the output:

[['American Mathematical Society', 'Mathematics & Statistics', 2, 'JournalJournal'], ['American Mathematical Society', 'Mathematics & Statistics', 4, 'Journal'], ['American Mathematical Society', 'Mathematics & Statistics', 38, 'Journal']]

However, there should be 4 items in the output. Instead I have only 3. I see that duplicate values have been removed from the results because two of the rows have value '2' in the 'Downloads JR1 2017' column.

Why? And how can I get all results returned?

The output I want to get to would be the name of the 'provider', with a sum of the 'Downloads JR1 2017'. Example:

['American Mathematical Society', 46]
Erich Purpur
  • 1,337
  • 3
  • 14
  • 34
  • 1
    For pandas questions, it always helps to see sample input and output. See [creating good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Jun 21 '19 at 16:19
  • I corrected my question with a sample input and output – Erich Purpur Jun 21 '19 at 17:34
  • 1
    It grouped the 'Downloads JR1 2017' column by values because you included it in the groupby columns. If you don't include it, then it won't be included in the grouping. – G. Anderson Jun 21 '19 at 17:46
  • Ok, I see that now. Thanks. But I am ultimately trying to sum all the values in the 'Downloads JR1 2017' column. So it looks like I am on the wrong track... – Erich Purpur Jun 21 '19 at 17:54

2 Answers2

3

So you can check transform

jr1_provider = provider_subset.groupby(['Provider', 'Field', 'Downloads JR1 2017'], as_index=False).transform('sum').values.tolist()
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I am now getting a value error: Length mismatch: Expected axis has 13 elements, new values have 14 elements – Erich Purpur Jun 21 '19 at 17:45
  • @ErichPurpur maybe show us some data , I am not sure why you have 14 row rather than 13 , with transform , you should produced the same length result as df – BENY Jun 21 '19 at 17:59
2

Based on your additional detail in the comments, how about

df.groupby(['Provider', 'Field'], as_index=False).sum()
G. Anderson
  • 5,815
  • 2
  • 14
  • 21