4

I wasn't sure how to title this.

Assume the following Pandas DataFrame:

    Student ID      Class   
1   John    99124   Biology
2   John    99124   History
3   John    99124   Geometry
4   Sarah   74323   Physics
5   Sarah   74323   Geography
6   Sarah   74323   Algebra
7   Alex    80045   Trigonometry
8   Alex    80045   Economics
9   Alex    80045   French

I'd like to reduce the number of rows in this DataFrame by creating a list of classes that each student is taking, and then putting that in the "class" column. Here's my desired output:

    Student ID      Class
1   John    99124   ["Biology","History","Geometry"]
2   Sarah   74323   ["Physics","Geography","Algebra"]
3   Alex    80045   ["Trigonometry","Economics","French"]

I am working with a large DataFrame that is not as nicely organized as this example. Any help is appreciated.

superfeen
  • 83
  • 6
  • 2
    I'm not sure a list is really the ideal solution. Storing objects in a DataFrame just creates an obstacle for most future manipulations. For instance, it's no longer trivial to check which students have taken `Biology`, or at least not without a much less performant method. – ALollz Aug 02 '19 at 19:29
  • 1
    @ALollz do you recommend an alternative approach? – superfeen Aug 02 '19 at 19:43
  • 1
    It depends what kind of manipulations you need in the end. Having a long format where each row is a unique key (as in the above) can be useful and is flexible. It's not difficult to `pivot`, or get dummies from there, which might be more suitable for other things. – ALollz Aug 02 '19 at 19:58

4 Answers4

2

You need to groupby on Student and ID and then use agg.

df.groupby(['Student', 'ID'], as_index=False).agg({'Class': list})

Ouput:

  Student     ID                              Class
0    Alex  80045  [Trigonometry, Economics, French]
1    John  99124       [Biology, History, Geometry]
2   Sarah  74323      [Physics, Geography, Algebra]
harvpan
  • 8,571
  • 2
  • 18
  • 36
1
df.groupby('ID')['Class'].apply(list)
Carsten
  • 2,765
  • 1
  • 13
  • 28
1

let's see, using some help Apply multiple functions to multiple groupby columns

you could write something like

df= df.groupby('student').agg({'id':'max', 'Class': lambda x: x.tolist()})

hope it helps, giulio

giulio
  • 157
  • 8
1

try like below

df.groupby(['Student', 'ID'],as_index=False).agg(lambda x:','.join('"'+x+'"'))
Zaynul Abadin Tuhin
  • 31,407
  • 5
  • 33
  • 63