Collapse some rows in columns in Pandas Python

Question

I have a Dataframe in Pandas like this:

            ID  rating     G1     G2     G3     G4  G5  G6  G7
0           1     2.5     18      0      0      0   0   0   0
1           4     4.0     18      0      0      0   0   0   0
2           7     3.0     78      1      0      0   0   0   0
3           1     4.0     21      7      8     10  30  40  20
4          21     3.0     18      0      0      0   0   0   0
5           7     2.0     18      80     10    11   8   0   0
6          41     3.5     18      0      9     10   0   0   0

and I would like gruoping all the elements by ID such to obtain a sort of continuos dataframe in pandas with row array entries like this:

            ID    H1      H2                        
0           1   [2.5,18]  [4.0,21,7,8,10,30,40,20]  
1           4   [4.0,18]  Nan                       
2           7   [3.0,78]  [2.0, 18, 80, 10, 11,8]   
3          21   [3.0,18]  Nan   
4          41   [3.5,18,76,9,10] Nan

Do you know if it is possible? Thanks

If there is a 3rd repeated ID then is there a h3 column ? – Umar.H Oct 15 '19 at 07:58 — Umar.H, Oct 15 '19 at 07:58
Yes, more entries in row for Id, more columns at the end – user3043636 Oct 15 '19 at 08:04 — user3043636, Oct 15 '19 at 08:04

score 2 · Answer 1 · edited Dec 20 '19 at 14:22

2

Use:

#reshape by unstack per ID, concert series to one column DataFrame
df = df.set_index('ID').stack().to_frame('s')
#compare by 0
mask = df['s'].eq(0)
#helper column for consecutive 0 values
df['m'] = mask.groupby(level=0).cumsum()
#filter out 0 rows
df = df[~mask].reset_index()
#helper column for new columns names
df['g'] = df.groupby('ID')['m'].rank(method='dense').astype(int)
#create lists per groups, rehape and add prefix
df = (df.groupby(['ID','g'])['s'].apply(list)
        .unstack()
        .add_prefix('H')
        .rename_axis(None, axis=1)
        .reset_index())
print (df)
   ID                H1                                             H2
0   1       [2.5, 18.0]  [4.0, 21.0, 7.0, 8.0, 10.0, 30.0, 40.0, 20.0]
1   4       [4.0, 18.0]                                            NaN
2   7  [3.0, 78.0, 1.0]             [2.0, 18.0, 80.0, 10.0, 11.0, 8.0]
3  21       [3.0, 18.0]                                            NaN
4  41       [3.5, 18.0]                                    [9.0, 10.0]

edited Dec 20 '19 at 14:22

marc_s

732,580
175
1,330
1,459

answered Oct 15 '19 at 10:28

jezrael

822,522
95
1,334
1,252

why does 41 have values in h2? – Umar.H Oct 15 '19 at 11:42
@Datanovice - Because there is `0` between `18` and `9` in `3.5 18 0 9 10 0 0 0` – jezrael Oct 15 '19 at 11:44
2

ah, but it didnt match his expected output so I was confused. – Umar.H Oct 15 '19 at 11:45
well he did green tick so i guess it was right lol, I put my answer just incase. – Umar.H Oct 15 '19 at 11:54

Umar.H · Accepted Answer · 2019-10-15T12:33:47.067

Sorry - got pulled into a meeting earlier : this is how I would have tackled it :

df1 = df.groupby([df.index, "ID"]).agg(
lambda x: x.replace(0, np.nan).dropna().tolist())
# Create a sum of each list
df1['list_'] = df1.sum(axis=1)
print(df1['list_'])
    0                      [2.5, 18]
1                          [4.0, 18]
2                       [3.0, 78, 1]
3    [4.0, 21, 7, 8, 10, 30, 40, 20]
4                          [3.0, 18]
5           [2.0, 18, 80, 10, 11, 8]
6                   [3.5, 18, 9, 10]
Name: list_, dtype: object

Then create our counter column with cumcount and pivot using crosstab.

# Create a row to use for columns
df1['count'] = 'H' + (df1.groupby('ID').cumcount() + 1).astype(str)

df1.reset_index(level=1,inplace=True)

final_ = pd.crosstab(df1["ID"], 
        df1["count"], 
        values=df1["list_"], 
        aggfunc="first").reset_index()

print(final_)


        ID                H1                              H2
0       1         [2.5, 18]  [4.0, 21, 7, 8, 10, 30, 40, 20]
1       4         [4.0, 18]                              NaN
2       7      [3.0, 78, 1]         [2.0, 18, 80, 10, 11, 8]
3      21         [3.0, 18]                              NaN
4      41  [3.5, 18, 9, 10]                              NaN

only pain point I can see with this is that my list_ column is an object, unsure if you will do further operations with it later. Jezrael's solution will be more appropriate if so.

Edit for anyone else who comes across this :

using python's .sum() is one of the slowest methods for list concatenation, if performance is a concern please see : How to make a flat list out of list of lists

added in my answer, thanks as always Jozi, you have taught me so much on SO! — Umar.H, Oct 15 '19 at 12:34
This solution, even if it provides the correct response, returns unordered 'Hx' columns. For instance, H1 H10 H8 H97 ... I tried to use final_ = final_.reindex(sorted(final_.columns), axis=1) but without any success — user3043636, Oct 15 '19 at 12:48
@user3043636 try : `(df1.groupby('ID').cumcount() + 1).astype(str).str.zfill(2)` if that doesn't work you need to order the columns alpha numerically [order columns alpha numerically](https://stackoverflow.com/questions/56663902/re-ordering-df-columns-alpha-numerically) which is ironically answered by Jezrael. — Umar.H, Oct 15 '19 at 12:51

Collapse some rows in columns in Pandas Python

2 Answers2