1

I have a Pandas dataframe:

test=pd.DataFrame(columns=['GroupID','Sample','SampleMeta','Value'])
test.loc[0,:]='1','S1','S1_meta',1
test.loc[1,:]='1','S1','S1_meta',1
test.loc[2,:]='2','S2','S2_meta',1

I want to (1) group by two columns ('GroupID' and 'Sample'), (2) sum 'Value' per group, and (3) retain only unique values in 'SampleMeta' per group. The desired result ('GroupID' and 'Sample' as index) is shown:

                SampleMeta  Value
GroupID Sample                       
1       S1      S1_meta      2
2       S2      S2_meta      1 

df.groupby() and the .sum() method get close, but .sum() concatenates identical values in the 'Values' column within a group. As a result, the 'S1_meta' value is duplicated.

g=test.groupby(['GroupID','Sample'])
print g.sum()

                SampleMeta      Value
GroupID Sample                       
1       S1      S1_metaS1_meta  2
2       S2      S2_meta         1 

Is there a way to achieve the desired result using groupby() and associated methods? Merging the summed 'Value' per group with a separate 'SampleMeta' DataFrame works but there must be a more elegant solution.

lmart999
  • 6,671
  • 10
  • 29
  • 37

1 Answers1

0

Well, you can include SampleMeta as part of the groupby:

print test.groupby(['GroupID','Sample','SampleMeta']).sum()

                           Value
GroupID Sample SampleMeta       
1       S1     S1_meta         2
2       S2     S2_meta         1

If you don't want SampleMeta as part of the index when done you could modify it as follows:

print test.groupby(['GroupID','Sample','SampleMeta']).sum().reset_index(level=2)

               SampleMeta  Value
GroupID Sample                  
1       S1        S1_meta      2
2       S2        S2_meta      1

This will only work right if there is no variation within SampleMeta for ['GroupID','Sample']. Of course, If there was variation within ['GroupID','Sample'] then you probably to exclude SampleMeta from the groupby/sum entirely:

print test.groupby(['GroupID','Sample'])['Value'].sum()

GroupID  Sample
1        S1        2
2        S2        1
Karl D.
  • 13,332
  • 5
  • 56
  • 38
  • Thanks, this works well for the example I gave. For large DataFrames (>100k entries) with several columns that I want to preserve, including these columns in groupby made the operation very slow. So, this strategy may not scale well to large DataFrame with many columns (like `SampleMeta') to preserve. – lmart999 May 14 '14 at 08:37