8

I have a Pandas dataframe with the following structure:

A       B       C
a       b       1
a       b       2
a       b       3
c       d       7
c       d       8
c       d       5
c       d       6
c       d       3
e       b       4
e       b       3
e       b       2
e       b       1

And I will like to transform it into this:

A       B       C1      C2      C3      C4      C5
a       b       1       2       3       NAN     NAN
c       d       7       8       5       6       3
e       b       4       3       2       1       NAN

In other words, something like groupby A and B and expand C into different columns.

Knowing that the length of each group is different.

C is already ordered

Shorter groups can have NAN or NULL values (empty), it does not matter.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
mirix
  • 511
  • 1
  • 5
  • 13
  • Don't put a solution in the question. I've removed it for you. Normally, you'd [post your own answer](/help/self-answer) instead, but where this change is so minor, it'd be better to [suggest an edit on the answer](/posts/74401567/edit). – wjandrea Nov 11 '22 at 22:31
  • @wjandrea, in the future, please, refrain from erasing the most correct solution. You are wasting everybody's time, including yours. – mirix Nov 12 '22 at 19:14
  • IMO, it's fundamentally the same as the accepted solution, so it doesn't require its own answer. (And I assume anyone getting the warning could solve it themselves.) In any case, it's important to separate questions and answers because of the format of this site: problems up top, solutions below. Others have said the same thing, like here's [a mod's take on a vaguely similar situation](https://meta.stackoverflow.com/a/267437/4518341). – wjandrea Nov 12 '22 at 19:57
  • Right now, it is essentially the same. But the deprecation warning means that in the future the accepted solution will not work anymore. Therefore, you are making future coders wasting their time by erasing the correct solution. – mirix Nov 13 '22 at 21:01
  • It'll still be fundamentally the same since it's only a minor part that changes. We also have answers on this site written for Python 2 that still use `print` as a statement instead of a function and all you need to do to make them work in Python 3 is add parentheses. Although, editing them to add the parentheses is perfectly fine, and that's just like what I'm suggesting here. – wjandrea Nov 13 '22 at 21:23

3 Answers3

12

Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc...). Finally use DataFrame.rename_axis to remove the indexes original name ('g') and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
   A  B   C1   C2   C3   C4   C5
0  a  b  1.0  2.0  3.0  NaN  NaN
1  c  d  7.0  8.0  5.0  6.0  3.0
2  e  b  4.0  3.0  2.0  1.0  NaN

Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = (df.pivot(['A','B'], 'g', 'C')
        .add_prefix('C')
        .astype('Int64')
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B  C1  C2  C3    C4    C5
0  a  b   1   2   3  <NA>  <NA>
1  c  d   7   8   5     6     3
2  e  b   4   3   2     1  <NA>

EDIT: If there's a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:

N = 4
g  = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
        .add_prefix('C')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B   C1   C2   C3   C4
0  a  b  1.0  2.0  3.0  NaN
1  c  d  7.0  8.0  5.0  6.0
2  c  d  3.0  NaN  NaN  NaN
3  e  b  4.0  3.0  2.0  1.0 
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    can you explain a bit more the idea behind this code? – seralouk Nov 11 '22 at 11:08
  • Thanks! It almost does the trick. However, instead of producing 10 C columns (10 is the maximum length of a group), it produces a few dozens (C_1...C_3000). The numerical indexes are not correlative. – mirix Nov 11 '22 at 11:12
  • @jezrael Do you need `.rename_axis(columns=None)` ? I removed it and I think it works fine. Let me know because I would like to edit the answer slightly. It is brilliant. – Celius Stingher Nov 11 '22 at 11:17
  • @mirix - added solution to `EDIT` – jezrael Nov 11 '22 at 11:18
  • @CeliusStingher - then in ouput is `g` for first solution and `g2` for EDIT solution if remove `.rename_axis(columns=None)` – jezrael Nov 11 '22 at 11:18
  • 1
    Ok, but also can be added after `reset_index()` Thanks. I will edit the answer shortly if you don't mind – Celius Stingher Nov 11 '22 at 11:22
  • 1
    It works! Thanks a million! Just a little deprecation warning: FutureWarning: In a future version of pandas all arguments of DataFrame.pivot will be keyword-only. – mirix Nov 11 '22 at 11:25
0
df1.astype({'C':str}).groupby([*'AB'])\
    .agg(','.join).C.str.split(',',expand=True)\
    .add_prefix('C').reset_index()


 A  B C0 C1 C2    C3    C4
0  a  b  1  2  3  None  None
1  c  d  7  8  5     6     3
2  e  b  4  3  2     1  None
G.G
  • 639
  • 1
  • 5
0

The accepted solution but avoiding the deprecation warning:

N = 3
g  = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
        .add_prefix('C_')
        .astype('Int64')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())
wjandrea
  • 28,235
  • 9
  • 60
  • 81
mirix
  • 511
  • 1
  • 5
  • 13