2

This question builds on Pandas melt several groups of columns into multiple target columns by name. However, my data is organised in an annyoing way

import pandas as pd

df = pd.DataFrame([(101, 'a', 'b', 'c', 'd', 'e', 'f', 1, 2, 3, 4, 5, 6, 'aa', 'bb', 'cc', 'dd', 'ee', 'ff'),
                   (102,'g', 'h', 'i', 'j', 'k', 'l' , 7, 8, 9, 10, 11, 12, 'gg', 'hh', 'ii', 'jj', 'kk', 'll')], 
                   columns=['id','a__1', 'a__2', 'a__3', 'a_1', 'a_2', 'a_3','b__1', 'b__2', 'b__3', 'b_1', 'b_2', 'b_3','c__1', 'c__2', 'c__3', 'c_1', 'c_2', 'c_3'])

df

And the dataframe looks like this:

    id a__1 a__2 a__3 a_1 a_2 a_3  b__1  b__2  b__3  b_1  b_2  b_3 c__1 c__2  \
0  101    a    b    c   d   e   f     1     2     3    4    5    6   aa   bb   
1  102    g    h    i   j   k   l     7     8     9   10   11   12   gg   hh   

  c__3 c_1 c_2 c_3  
0   cc  dd  ee  ff  
1   ii  jj  kk  ll 

As you can see instead of all the seperators only being one underscore they are sometimes two underscores.

In the end I would like to organise my data to be in the following way, note that the a__1 should come before a_1 and so on:

     id   a   b   c
0   101   a   1   aa
1   101   b   2   bb
2   101   c   3   cc
3   101   d   4   dd
4   101   e   5   ee
5   101   f   6   ff   
6   102   g   7   gg
7   102   h   8   hh
8   102   i   9   ii
9   102   j   10  jj
10  102   k   11  kk
11  102   l   12  ll

Basically my problem is that instead of only one underscore I sometimes have 2. Now this doesn't work with the pd.wide_to_long function since I cannot set the separator argument as '_' or '__'. I hope you understand my question.

cs95
  • 379,657
  • 97
  • 704
  • 746
Elias K.
  • 513
  • 1
  • 4
  • 12

2 Answers2

3

You could use @Wen old favorite pd.wide_to_long:

pd.wide_to_long(df,['a','b','c'],i='id',j='new',sep='_',suffix='.')\
  .reset_index('new', drop=True)\
  .sort_index()

Output:

     a   b   c
id            
101  a   1  aa
101  b   2  bb
101  c   3  cc
101  d   4  dd
101  e   5  ee
101  f   6  ff
102  g   7  gg
102  h   8  hh
102  i   9  ii
102  j  10  jj
102  k  11  kk
102  l  12  ll
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Could I ask what the difference between this pd.wide_to_long is and the one in the question I linked? – Elias K. Apr 30 '18 at 15:54
  • 1
    There is no difference, just don't see why you would use pd.wide_to_long. The "New" column or "dropme" column isn't used anyway, so the single _ vs __ is an issue. Just use a single _ as the seperator and you are fine. – Scott Boston Apr 30 '18 at 15:57
  • One more question, what if I also had a column called b for example that has a value between b__3 and b_1 so something like 3.5. How would i change your method since now one of the column names is the same as the stub? – Elias K. Apr 30 '18 at 16:32
  • 1
    @EliasK. At that point, I would rename the columns to standardize then use the common seperator. IE. df.coumns = df.columns.str.replace('.','_') or something similiar that works as a separator. – Scott Boston Apr 30 '18 at 17:16
2

You can using repalce reaplce all '_' and '__' to '', then we are using stack to reshape your df, after using cumcount create the unique key we can achieve what you need

df.columns=df.columns.str.replace('_','')
s=df.set_index('id').stack().reset_index(level=1)
s.level_1=s.level_1.str[0]
s['New']=s.groupby('level_1').cumcount()



s.set_index(['New','level_1'],append=True)[0].unstack()


Out[509]: 
level_1  a   b   c
id  New           
101 0    a   1  aa
    1    b   2  bb
    2    c   3  cc
    3    d   4  dd
    4    e   5  ee
    5    f   6  ff
102 6    g   7  gg
    7    h   8  hh
    8    i   9  ii
    9    j  10  jj
    10   k  11  kk
    11   l  12  ll
BENY
  • 317,841
  • 20
  • 164
  • 234