2

I have a Dataframe in Pandas like this:

            ID  rating     G1     G2     G3     G4  G5  G6  G7
0           1     2.5     18      0      0      0   0   0   0
1           4     4.0     18      0      0      0   0   0   0
2           7     3.0     78      1      0      0   0   0   0
3           1     4.0     21      7      8     10  30  40  20
4          21     3.0     18      0      0      0   0   0   0
5           7     2.0     18      80     10    11   8   0   0
6          41     3.5     18      0      9     10   0   0   0

and I would like gruoping all the elements by ID such to obtain a sort of continuos dataframe in pandas with row array entries like this:

            ID    H1      H2                        
0           1   [2.5,18]  [4.0,21,7,8,10,30,40,20]  
1           4   [4.0,18]  Nan                       
2           7   [3.0,78]  [2.0, 18, 80, 10, 11,8]   
3          21   [3.0,18]  Nan   
4          41   [3.5,18,76,9,10] Nan

Do you know if it is possible? Thanks

user3043636
  • 559
  • 6
  • 23

2 Answers2

2

Use:

#reshape by unstack per ID, concert series to one column DataFrame
df = df.set_index('ID').stack().to_frame('s')
#compare by 0
mask = df['s'].eq(0)
#helper column for consecutive 0 values
df['m'] = mask.groupby(level=0).cumsum()
#filter out 0 rows
df = df[~mask].reset_index()
#helper column for new columns names
df['g'] = df.groupby('ID')['m'].rank(method='dense').astype(int)
#create lists per groups, rehape and add prefix
df = (df.groupby(['ID','g'])['s'].apply(list)
        .unstack()
        .add_prefix('H')
        .rename_axis(None, axis=1)
        .reset_index())
print (df)
   ID                H1                                             H2
0   1       [2.5, 18.0]  [4.0, 21.0, 7.0, 8.0, 10.0, 30.0, 40.0, 20.0]
1   4       [4.0, 18.0]                                            NaN
2   7  [3.0, 78.0, 1.0]             [2.0, 18.0, 80.0, 10.0, 11.0, 8.0]
3  21       [3.0, 18.0]                                            NaN
4  41       [3.5, 18.0]                                    [9.0, 10.0]
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Sorry - got pulled into a meeting earlier : this is how I would have tackled it :

df1 = df.groupby([df.index, "ID"]).agg(
lambda x: x.replace(0, np.nan).dropna().tolist())
# Create a sum of each list
df1['list_'] = df1.sum(axis=1)
print(df1['list_'])
    0                      [2.5, 18]
1                          [4.0, 18]
2                       [3.0, 78, 1]
3    [4.0, 21, 7, 8, 10, 30, 40, 20]
4                          [3.0, 18]
5           [2.0, 18, 80, 10, 11, 8]
6                   [3.5, 18, 9, 10]
Name: list_, dtype: object
Then create our counter column with cumcount and pivot using crosstab.
# Create a row to use for columns
df1['count'] = 'H' + (df1.groupby('ID').cumcount() + 1).astype(str)

df1.reset_index(level=1,inplace=True)

final_ = pd.crosstab(df1["ID"], 
        df1["count"], 
        values=df1["list_"], 
        aggfunc="first").reset_index()

print(final_)


        ID                H1                              H2
0       1         [2.5, 18]  [4.0, 21, 7, 8, 10, 30, 40, 20]
1       4         [4.0, 18]                              NaN
2       7      [3.0, 78, 1]         [2.0, 18, 80, 10, 11, 8]
3      21         [3.0, 18]                              NaN
4      41  [3.5, 18, 9, 10]                              NaN

only pain point I can see with this is that my list_ column is an object, unsure if you will do further operations with it later. Jezrael's solution will be more appropriate if so.

Edit for anyone else who comes across this :

using python's .sum() is one of the slowest methods for list concatenation, if performance is a concern please see : How to make a flat list out of list of lists

Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • added in my answer, thanks as always Jozi, you have taught me so much on SO! – Umar.H Oct 15 '19 at 12:34
  • This solution, even if it provides the correct response, returns unordered 'Hx' columns. For instance, H1 H10 H8 H97 ... I tried to use final_ = final_.reindex(sorted(final_.columns), axis=1) but without any success – user3043636 Oct 15 '19 at 12:48
  • @user3043636 try : `(df1.groupby('ID').cumcount() + 1).astype(str).str.zfill(2)` if that doesn't work you need to order the columns alpha numerically [order columns alpha numerically](https://stackoverflow.com/questions/56663902/re-ordering-df-columns-alpha-numerically) which is ironically answered by Jezrael. – Umar.H Oct 15 '19 at 12:51