1

I have a pandas dataframe like so :

cid code max date  

1   A    32    date1  
1   B    9     date2
1   C    25    date3
2   A    33    date4
2   B    11    date5

Basically, for every CID there might be N number of entries and N varies for each CID, for some it might be 1, 2 for some it might be 3 or more. I want to concatenate all rows having the same CID. I know some columns will end up empty for some IDs since their 'N' will be lower as compared to the N of other CIDs so I want to fill out -1 for those empty columns

I ran the following to group the dataframe by "cid" column :

maxscoredf = maxscoredf.set_index(['cid',maxscoredf.groupby('cid').cumcount().add(1)])

When I try to unstack using

maxscoredf = maxscoredf.unstack(fill_value = -1) #Memory Error. requires 221GB RAM

How do I circumvent this memory error ? The goal is to get all values for the same cid in the same row like so :

id code1 mean1 count1 code2 mean2 count2 code3 mean3 count3

1   A    32      22    B     9     56     C     25    78
2   A    33      35    B     11    66     -1    -1    -1

With any missing values substituted by -1 in the dataframe.
Using code in this answer : https://stackoverflow.com/a/66009708/6916919 Pandas version : 0.21, Using this specific version because https://stackoverflow.com/a/61757908/6916919
Please ask for any additional info that might be required

Tanmay Bhatnagar
  • 2,330
  • 4
  • 30
  • 50

0 Answers0