11

How can I melt a pandas data frame using multiple variable names and values? I have the following data frame that changes its shape in a for loop. In one of the for loop iterations, it looks like this:

ID  Cat    Class_A   Class_B     Prob_A     Prob_B
1   Veg      1        2          0.9         0.1
2   Veg      1        2          0.8         0.2
3   Meat     1        2          0.6         0.4
4   Meat     1        2          0.3         0.7
5   Veg      1        2          0.2         0.8

I need to melt it in such a way that it looks like this:

ID  Cat    Class     Prob    
1   Veg      1       0.9       
1   Veg      2       0.1
2   Veg      1       0.8        
2   Veg      2       0.2
3   Meat     1       0.6         
3   Meat     2       0.4
4   Meat     1       0.3         
4   Meat     2       0.7
5   Veg      1       0.2         
5   Veg      2       0.8

During the for loop the data frame will contain different number of classes with their probabilities. That is why I am looking for a general approach that is applicable in all my for loop iterations. I saw this question and this but they were not helpful!

vestland
  • 55,229
  • 37
  • 187
  • 305
owise
  • 1,055
  • 16
  • 28

5 Answers5

13

You need lreshape by dict for specify categories:

d = {'Class':['Class_A', 'Class_B'], 'Prob':['Prob_A','Prob_B']}
df = pd.lreshape(df,d)
print (df)
    Cat  ID  Class  Prob
0   Veg   1      1   0.9
1   Veg   2      1   0.8
2  Meat   3      1   0.6
3  Meat   4      1   0.3
4   Veg   5      1   0.2
5   Veg   1      2   0.1
6   Veg   2      2   0.2
7  Meat   3      2   0.4
8  Meat   4      2   0.7
9   Veg   5      2   0.8

More dynamic solution:

Class = [col for col in df.columns if col.startswith('Class')]
Prob = [col for col in df.columns if col.startswith('Prob')]
df = pd.lreshape(df, {'Class':Class, 'Prob':Prob})
print (df)
    Cat  ID  Class  Prob
0   Veg   1      1   0.9
1   Veg   2      1   0.8
2  Meat   3      1   0.6
3  Meat   4      1   0.3
4   Veg   5      1   0.2
5   Veg   1      2   0.1
6   Veg   2      2   0.2
7  Meat   3      2   0.4
8  Meat   4      2   0.7
9   Veg   5      2   0.8

EDIT:

lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).

Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @jezrael: what about if I need to have under Class the names of the classes (''Class_A', 'Class_B') rather than thier codes ('1', '2')? –  owise Aug 13 '17 at 17:28
  • I think the simpliest is `df['Class_A'] = 'Class_A'` and `df['Class_B'] = 'Class_B'` - replace values by columns with scalar and then use `lreshape` – jezrael Aug 13 '17 at 17:47
  • @jezrael I don't think undocumented functions are good to use, seeing that you had to link directly to the source code. There has been talk of removing `lreshape` – Ted Petrou Aug 18 '17 at 18:08
  • @Ted Petrou - Thank you for comment. Yes, is possible sometimes in future function can be removed or not, both function are used rarely. But I think it is not reason for downvote :( – jezrael Aug 18 '17 at 19:21
  • I think this answer is really bad which is why I downvoted it. You still have 10 upvotes. In my opinion, you should almost never use undocumented features, especially ones that have an equivalent and superior counterpart like `pd.wide_to_long`. Now people are going to come to this question and think this is a proper way to melt two columns at the same time. There is no way anyone should use this in production. – Ted Petrou Aug 18 '17 at 19:32
  • Hmmm, it is really bad because undocumented function? In my opinion not. And it is really bad use it? Hmmm, interesting. I dont agree. I think undocumented == really bad is very strong decision and if you think it no problem. I agree your function is another possible solution, but in my opinion... – jezrael Aug 18 '17 at 19:47
  • ...in future are 2 possible ways - function will be removed or not. It depends of devs of pandas. Good luck and thank you for your opinion, although I disagree, becasue I cannot see future. :) – jezrael Aug 18 '17 at 19:49
  • You write like you have never written any production code in your life. Yes, functions are undocumented for a reason. And no its not necessarily bad but in this case it is. This function has been in existence since at least 2012. Why has it been undocumented for so many years? There's a perfectly good alternative that is documented and you should use that one. Maybe @jeff can chime in. I asked him on [github](https://github.com/pandas-dev/pandas/issues/15003) – Ted Petrou Aug 18 '17 at 20:01
4

Or you can try this by using str.contain and pd.concat

DF1=df2.loc[:,df2.columns.str.contains('_A|Cat|ID')]
name=['ID','Cat','Class','Prob']
DF1.columns=name
DF2=df2.loc[:,df2.columns.str.contains('_B|Cat|ID')]
DF2.columns=name
pd.concat([DF1,DF2],axis=0)

Out[354]: 
   ID   Cat  Class  Prob
0   1   Veg      1   0.9
1   2   Veg      1   0.8
2   3  Meat      1   0.6
3   4  Meat      1   0.3
4   5   Veg      1   0.2
0   1   Veg      2   0.1
1   2   Veg      2   0.2
2   3  Meat      2   0.4
3   4  Meat      2   0.7
4   5   Veg      2   0.8
BENY
  • 317,841
  • 20
  • 164
  • 234
0

The top voted answer uses the undocumented lreshape which may at some point get deprecated because of its similarity to pd.wide_to_long which is documented and can use directly here. By default suffix matches only to numbers. You must change this to match characters (here I just used any character).

pd.wide_to_long(df, stubnames=['Class', 'Prob'], i=['ID', 'Cat'], j='DROPME', suffix='.')\
  .reset_index()\
  .drop('DROPME', axis=1)

   ID   Cat  Class  Prob
0   1   Veg      1   0.9
1   1   Veg      2   0.1
2   2   Veg      1   0.8
3   2   Veg      2   0.2
4   3  Meat      1   0.6
5   3  Meat      2   0.4
6   4  Meat      1   0.3
7   4  Meat      2   0.7
8   5   Veg      1   0.2
9   5   Veg      2   0.8
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
0

You could also use pd.melt.

# Make DataFrame
df = pd.DataFrame({'ID' :  [i for i in range(1,6)],
                   'Cat' : ['Veg']*2 + ['Meat']*2 + ['Veg'],
                   'Class_A' : [1]*5,
                   'Class_B' : [2]*5,
                   'Prob_A' : [0.9, 0.8, 0.6, 0.3, 0.2],
                   'Prob_B' : [0.1, 0.2, 0.4, 0.7, 0.8]})

# Make class dataframe and prob dataframe
df_class = df.loc[:, ['ID', 'Cat', 'Class_A', 'Class_B']]
df_prob = df.loc[:, ['ID', 'Cat', 'Prob_A', 'Prob_B']]

# Melt class dataframe and prob dataframe
df_class = df_class.melt(id_vars = ['ID',
                                    'Cat'],
                         value_vars = ['Class_A',
                                       'Class_B'],
                         value_name = 'Class')
df_prob = df_prob.melt(id_vars = ['ID',
                                  'Cat'],
                       value_vars = ['Prob_A',
                                     'Prob_B'],
                       value_name = 'Prob')

# Clean variable column so only 'A','B' is left in both dataframes
df_class.loc[:, 'variable'] = df_class.loc[:, 'variable'].str.partition('_')[2]
df_prob.loc[:, 'variable'] = df_prob.loc[:, 'variable'].str.partition('_')[2]

# Merge class dataframe with prob dataframe on 'ID', 'Cat', and 'variable';
# drop 'variable'; sort values by 'ID', 'Cat'
final = df_class.merge(df_prob,
                       how = 'inner',
                       on = ['ID',
                             'Cat',
                             'variable']).drop('variable', axis = 1).sort_values(by = ['ID',
                                                                                       'Cat'])
Ian Thompson
  • 2,914
  • 2
  • 18
  • 31
0

One option is pivot_longer from pyjanitor, which abstracts the process, and is efficient:

# pip install janitor
import janitor

df.pivot_longer(
    index = ['ID', 'Cat'], 
    names_to = '.value', 
    names_pattern = '([a-zA-Z]+)_*')

   ID   Cat  Class  Prob
0   1   Veg      1   0.9
1   2   Veg      1   0.8
2   3  Meat      1   0.6
3   4  Meat      1   0.3
4   5   Veg      1   0.2
5   1   Veg      2   0.1
6   2   Veg      2   0.2
7   3  Meat      2   0.4
8   4  Meat      2   0.7
9   5   Veg      2   0.8

The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31