As described by U12-Forward, melt
ing a dataframe primarily means reshaping the data from wide form to long form. More often than not, the new dataframe will have more rows and fewer columns compared to the original dataframe.
There are different scenarios when it comes to melting—all column labels could be melted into a single column, or multiple columns; some parts of the column labels could be retained as headers, while the rest are collated into a column, and so on. This answer shows how to melt a pandas dataframe, using pd.stack
, pd.melt
, pd.wide_to_long
and pivot_longer from pyjanitor (I am a contributor to the pyjanitor library). The examples won't be exhaustive, but hopefully should point you in the right direction when it comes to reshaping dataframes from wide to long form.
Sample Data
df = pd.DataFrame(
{'Sepal.Length': [5.1, 5.9],
'Sepal.Width': [3.5, 3.0],
'Petal.Length': [1.4, 5.1],
'Petal.Width': [0.2, 1.8],
'Species': ['setosa', 'virginica']}
)
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
Scenario 1 - Melt all columns:
In this case, we wish to convert all the specified column headers into rows - this can be done with pd.melt
or pd.stack
, and the solutions to problem 1 already cover this. The reshaping can also be done with pivot_longer
# pip install pyjanitor
import janitor
df.pivot_longer(index = 'Species')
Species variable value
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
Just like in pd.melt
, you can rename the variable
and value
column, by passing arguments to names_to
and values_to
parameters:
df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm')
Species dimension measurement_in_cm
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
You can also retain the original index, and keep the dataframe based on order of appearance:
df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm',
ignore_index = False,
sort_by_appearance=True)
Species dimension measurement_in_cm
0 setosa Sepal.Length 5.1
0 setosa Sepal.Width 3.5
0 setosa Petal.Length 1.4
0 setosa Petal.Width 0.2
1 virginica Sepal.Length 5.9
1 virginica Sepal.Width 3.0
1 virginica Petal.Length 5.1
1 virginica Petal.Width 1.8
By default, the values in names_to
are strings; they can be converted to other data types via the names_transform
parameter - this can be helpful/performant for large dataframes, as it is generally more efficient compared to converting the data types after the reshaping.
out = df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm',
ignore_index = False,
sort_by_appearance=True,
names_transform = 'category')
out.dtypes
Species object
dimension category
measurement_in_cm float64
dtype: object
Scenario 2 - Melt column labels into multiple columns:
So far, we've melted our data into single columns, one for the column names and one for the values. However, there might be scenarios where we wish to split the column labels into different columns, or even the values into different columns. Continuing with our sample data, we could prefer to have sepal
and petal
under a part
column, while length
and width
are into a dimension
column:
Via pd.melt
- The separation is done after the melt:
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.drop(columns = 'variable')
)
Species value part dimension
0 setosa 5.1 Sepal Length
1 virginica 5.9 Sepal Length
2 setosa 3.5 Sepal Width
3 virginica 3.0 Sepal Width
4 setosa 1.4 Petal Length
5 virginica 5.1 Petal Length
6 setosa 0.2 Petal Width
7 virginica 1.8 Petal Width
Via pd.stack
- offers a more efficient way of splitting the columns; the split is done on the columns, meaning less number of rows to deal with, meaning potentially faster outcome, as the data size increases:
out = df.set_index('Species')
# This returns a MultiIndex
out.columns = out.columns.str.split('.', expand = True)
new_names = ['part', 'dimension']
out.columns.names = new_names
out.stack(new_names).rename('value').reset_index()
Species part dimension value
0 setosa Petal Length 1.4
1 setosa Petal Width 0.2
2 setosa Sepal Length 5.1
3 setosa Sepal Width 3.5
4 virginica Petal Length 5.1
5 virginica Petal Width 1.8
6 virginica Sepal Length 5.9
7 virginica Sepal Width 3.0
Via pivot_longer
- The key thing to note about pivot_longer
is that it looks for patterns. The column labels are separated by a dot .
. Simply pass a list/tuple of new names to names_to
, and pass a separator to names_sep
(under the hood it just uses pd.str.split
):
df.pivot_longer(index = 'Species',
names_to = ('part', 'dimension'),
names_sep='.')
Species part dimension value
0 setosa Sepal Length 5.1
1 virginica Sepal Length 5.9
2 setosa Sepal Width 3.5
3 virginica Sepal Width 3.0
4 setosa Petal Length 1.4
5 virginica Petal Length 5.1
6 setosa Petal Width 0.2
7 virginica Petal Width 1.8
So far, we've seen how melt, stack and pivot_longer can split the column labels into multiple new columns, as long as there is a defined separator. What if there isn't a clearly defined separator, like in the dataframe below?
# https://github.com/tidyverse/tidyr/blob/main/data-raw/who.csv
who = pd.DataFrame({'id': [1], 'new_sp_m5564': [2], 'newrel_f65': [3]})
who
id new_sp_m5564 newrel_f65
0 1 2 3
In the second column, we have multiple _
, compared to the third column which has just one _
. The goal here is to split the column labels into individual columns (sp
& rel
to diagnosis
column, m
& f
to gender
column, the numbers to age
column). One option is to extract the column sub labels via a regex.
Via pd.melt
- again with pd.melt
, the reshaping occurs after the melt:
out = who.melt('id')
regex = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)"
new_df = out.variable.str.extract(regex)
# pd.concat can be used here instead
out.drop(columns='variable').assign(**new_df)
id value diagnosis gender age
0 1 2 sp m 5564
1 1 3 rel f 65
Note how the extracts occurred for the regex in groups (the one in parentheses).
Via pd.stack
- just like in the previous example, the split is done on the columns, offering more in terms of efficiency:
out = who.set_index('id')
regex = r"new_?(.+)_(.)(\d+)"
new_names = ['diagnosis', 'age', 'gender']
# Returns a dataframe
new_cols = out.columns.str.extract(regex)
new_cols.columns = new_names
new_cols = pd.MultiIndex.from_frame(new_cols)
out.columns = new_cols
out.stack(new_names).rename('value').reset_index()
id diagnosis age gender value
0 1 rel f 65 3.0
1 1 sp m 5564 2.0
Again, the extracts occur for the regex in groups.
Via pivot_longer
- again we know the pattern, and the new column names, we simply pass those to the function, this time we use names_pattern
, since we are dealing with a regex. The extracts will match the regular expression in the groups (the ones in parentheses):
regex = r"new_?(.+)_(.)(\d+)"
new_names = ['diagnosis', 'age', 'gender']
who.pivot_longer(index = 'id',
names_to = new_names,
names_pattern = regex)
id diagnosis age gender value
0 1 sp m 5564 2
1 1 rel f 65 3
Scenario 3 - Melt column labels and values into multiple columns:
What if we wish to split the values into multiple columns as well? Let's use a fairly popular question on SO:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover'],
'State': ['Texas', 'Texas', 'Alabama'],
'Name':['Aria', 'Penelope', 'Niko'],
'Mango':[4, 10, 90],
'Orange': [10, 8, 14],
'Watermelon':[40, 99, 43],
'Gin':[16, 200, 34],
'Vodka':[20, 33, 18]},
columns=['City', 'State', 'Name', 'Mango', 'Orange', 'Watermelon', 'Gin', 'Vodka'])
df
City State Name Mango Orange Watermelon Gin Vodka
0 Houston Texas Aria 4 10 40 16 20
1 Austin Texas Penelope 10 8 99 200 33
2 Hoover Alabama Niko 90 14 43 34 18
The goal is to collate Mango
, Orange
, and Watermelon
into a fruits column, Gin
and Vodka
into a Drinks
column, and collate the respective values into Pounds
and Ounces
respectively.
Via pd.melt
- I am copying the excellent solution verbatim:
df1 = df.melt(id_vars=['City', 'State'],
value_vars=['Mango', 'Orange', 'Watermelon'],
var_name='Fruit', value_name='Pounds')
df2 = df.melt(id_vars=['City', 'State'],
value_vars=['Gin', 'Vodka'],
var_name='Drink', value_name='Ounces')
df1 = df1.set_index(['City', 'State', df1.groupby(['City', 'State']).cumcount()])
df2 = df2.set_index(['City', 'State', df2.groupby(['City', 'State']).cumcount()])
df3 = (pd.concat([df1, df2],axis=1)
.sort_index(level=2)
.reset_index(level=2, drop=True)
.reset_index())
print (df3)
City State Fruit Pounds Drink Ounces
0 Austin Texas Mango 10 Gin 200.0
1 Hoover Alabama Mango 90 Gin 34.0
2 Houston Texas Mango 4 Gin 16.0
3 Austin Texas Orange 8 Vodka 33.0
4 Hoover Alabama Orange 14 Vodka 18.0
5 Houston Texas Orange 10 Vodka 20.0
6 Austin Texas Watermelon 99 NaN NaN
7 Hoover Alabama Watermelon 43 NaN NaN
8 Houston Texas Watermelon 40 NaN NaN
Via pd.stack
- I can't think of a solution via stack, so I'll skip
Via pivot_longer
- The reshape can be efficiently done by passing the list of names to names_to
and values_to
, and pass a list of regular expressions to names_pattern
- when splitting values into multiple columns, a list of regex to names_pattern
is required:
df.pivot_longer(
index=["City", "State"],
column_names=slice("Mango", "Vodka"),
names_to=("Fruit", "Drink"),
values_to=("Pounds", "Ounces"),
names_pattern=[r"M|O|W", r"G|V"],
)
City State Fruit Pounds Drink Ounces
0 Houston Texas Mango 4 Gin 16.0
1 Austin Texas Mango 10 Gin 200.0
2 Hoover Alabama Mango 90 Gin 34.0
3 Houston Texas Orange 10 Vodka 20.0
4 Austin Texas Orange 8 Vodka 33.0
5 Hoover Alabama Orange 14 Vodka 18.0
6 Houston Texas Watermelon 40 None NaN
7 Austin Texas Watermelon 99 None NaN
8 Hoover Alabama Watermelon 43 None NaN
The efficiency is even more as the dataframe size increases.
Scenario 4 - Group similar columns together:
Extending the concept of melting into multiple columns, let's say we wish to group similar columns together. We do not care about retaining the column labels, just combining the values of similar columns into new columns.
df = pd.DataFrame({'x_1_mean': [10],
'x_2_mean': [20],
'y_1_mean': [30],
'y_2_mean': [40],
'unit': [50]})
df
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50
For the code above, we wish to combine similar columns (columns that start with the same letter) into new unique columns - all x*
columns will be lumped under x_mean
, while all y*
columns will be collated under y_mean
. We are not saving the column labels, we are only interested in the values of these columns:
Via pd.melt - one possible way via melt is to run it via groupby on the columns:
out = df.set_index('unit')
grouped = out.columns.str.split('_\d_').str.join('')
# group on the split
grouped = out.groupby(grouped, axis = 1)
# iterate, melt individually, and recombine to get a new dataframe
out = {key : frame.melt(ignore_index = False).value
for key, frame in grouped}
pd.DataFrame(out).reset_index()
unit xmean ymean
0 50 10 30
1 50 20 40
Via pd.stack - Here we split the columns and build a MultiIndex:
out = df.set_index('unit')
split = out.columns.str.split('_(\d)_')
split = [(f"{first}{last}", middle)
for first, middle, last
in split]
out.columns = pd.MultiIndex.from_tuples(split)
out.stack(-1).droplevel(-1).reset_index()
unit xmean ymean
0 50 10 30
1 50 20 40
Via pd.wide_to_long - Here we reorder the sub labels - move the numbers to the end of the columns:
out = df.set_index('unit')
out.columns = [f"{first}{last}_{middle}"
for first, middle, last
in out.columns.str.split('_(\d)_')]
(pd
.wide_to_long(
out.reset_index(),
stubnames = ['xmean', 'ymean'],
i = 'unit',
j = 'num',
sep = '_')
.droplevel(-1)
.reset_index()
)
unit xmean ymean
0 50 10 30
1 50 20 40
Via pivot_longer - Again, with pivot_longer
, it is all about the patterns. Simply pass a list of new column names to names_to
, and the corresponding regular expressions to names_pattern
:
df.pivot_longer(index = 'unit',
names_to = ['xmean', 'ymean'],
names_pattern = ['x', 'y']
)
unit xmean ymean
0 50 10 30
1 50 20 40
Note that with this pattern it is on a first come first serve basis - if the column order was flipped, pivot_longer
would give a different output. Lets see this in action:
# reorder the columns in a different form:
df = df.loc[:, ['x_1_mean', 'x_2_mean', 'y_2_mean', 'y_1_mean', 'unit']]
df
x_1_mean x_2_mean y_2_mean y_1_mean unit
0 10 20 40 30 50
Because the order has changed, x_1_mean
will be paired with y_2_mean
, because that is the first y
column it sees, while x_2_mean
gets paired with y_1_mean
:
df.pivot_longer(index = 'unit',
names_to = ['xmean', 'ymean'],
names_pattern = ['x', 'y']
)
unit xmean ymean
0 50 10 40
1 50 20 30
Note the difference in the output compared to the previous run. This is something to note when using names_pattern with a sequence. Order matters.
Scenario 5 - Retain part of the column names as headers:
This might probably be one of the biggest use cases when reshaping to long form. Some parts of the column label we may wish to keep as header, and move the remaining columns to new columns (or even ignore them).
Let's revisit our iris dataframe:
df = pd.DataFrame(
{'Sepal.Length': [5.1, 5.9],
'Sepal.Width': [3.5, 3.0],
'Petal.Length': [1.4, 5.1],
'Petal.Width': [0.2, 1.8],
'Species': ['setosa', 'virginica']}
)
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
Our goal here is to keep Sepal
, Petal
as column names, and the rest (Length
, Width
) are collated into a dimension
column:
Via pd.melt - A pivot is used after melting into long form:
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.pivot(['Species', 'dimension'], 'part', 'value')
.rename_axis(columns = None)
.reset_index()
)
Species dimension Petal Sepal
0 setosa Length 1.4 5.1
1 setosa Width 0.2 3.5
2 virginica Length 5.1 5.9
3 virginica Width 1.8 3.0
This is not as efficient as other options below, as this involves wide to long, then long to wide, this might have poor performance on large enough dataframe.
Via pd.stack - This offers more efficiency as most of the reshaping is on the columns - less is more.
out = df.set_index('Species')
out.columns = out.columns.str.split('.', expand = True)
out.columns.names = [None, 'dimension']
out.stack('dimension').reset_index()
Species dimension Petal Sepal
0 setosa Length 1.4 5.1
1 setosa Width 0.2 3.5
2 virginica Length 5.1 5.9
3 virginica Width 1.8 3.0
Via pd.wide_to_long - Straightforward - simply pass in the relevant arguments:
(pd
.wide_to_long(
df,
stubnames=['Sepal', 'Petal'],
i = 'Species',
j = 'dimension',
sep='.',
suffix='.+')
.reset_index()
)
Species dimension Sepal Petal
0 setosa Length 5.1 1.4
1 virginica Length 5.9 5.1
2 setosa Width 3.5 0.2
3 virginica Width 3.0 1.8
As the data size increases, pd.wide_to_long
might not be as efficient.
Via pivot_longer: Again, back to patterns. Since we are keeping a part of the column as header, we use .value
as a placeholder. The function sees the .value
and knows that that sub label has to remain as a header. The split in the columns can either be by names_sep
or names_pattern
. In this case, it is simpler to use names_sep
:
df.pivot_longer(index = 'Species',
names_to = ('.value', 'dimension'),
names_sep = '.')
Species dimension Sepal Petal
0 setosa Length 5.1 1.4
1 virginica Length 5.9 5.1
2 setosa Width 3.5 0.2
3 virginica Width 3.0 1.8
When the column is split with .
, we have Petal, Length
. When compared with ('.value', 'dimension')
, Petal
is associated with .value
, while Length
is associated with dimension
. Petal
stays as a column header, while Length
is lumped into the dimension
column. We didn't need to be explicit about the column name, we just use .value
and let the function do the heavy work. This way, if you have lots of columns, you don't need to work out what the columns to stay as headers should be, as long as you have the right pattern via names_sep
or names_pattern
.
What if we want the Length
/Width
as column names instead, and Petal/Sepal
get lumped into a part
column:
Via pd.melt
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.pivot(['Species', 'part'], 'dimension', 'value')
.rename_axis(columns = None)
.reset_index()
)
Species part Length Width
0 setosa Petal 1.4 0.2
1 setosa Sepal 5.1 3.5
2 virginica Petal 5.1 1.8
3 virginica Sepal 5.9 3.0
Via pd.stack:
out = df.set_index('Species')
out.columns = out.columns.str.split('.', expand = True)
out.columns.names = ['part', None]
out.stack('part').reset_index()
Species part Length Width
0 setosa Petal 1.4 0.2
1 setosa Sepal 5.1 3.5
2 virginica Petal 5.1 1.8
3 virginica Sepal 5.9 3.0
Via pd.wide_to_long - First, we need to reorder the columns, such that Length
/Width
are at the front:
out = df.set_index('Species')
out.columns = out.columns.str.split('.').str[::-1].str.join('.')
(pd
.wide_to_long(
out.reset_index(),
stubnames=['Length', 'Width'],
i = 'Species',
j = 'part',
sep='.',
suffix='.+')
.reset_index()
)
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8
Via pivot_longer:
df.pivot_longer(index = 'Species',
names_to = ('part', '.value'),
names_sep = '.')
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8
Notice that we did not have to do any column reordering (there are scenarios where column reordering is unavoidable), the function simply paired .value
with whatever the split from names_sep
gave and outputted the reshaped dataframe. You can even use multiple .value
where applicable. Let's revisit an earlier dataframe:
df = pd.DataFrame({'x_1_mean': [10],
'x_2_mean': [20],
'y_1_mean': [30],
'y_2_mean': [40],
'unit': [50]})
df
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50
df.pivot_longer(index = 'unit',
names_to = ('.value', '.value'),
names_pattern = r"(.).+(mean)")
unit xmean ymean
0 50 10 30
1 50 20 40
It is all about seeing the patterns and taking advantage of them. pivot_longer
just offers efficient and performant abstractions over common reshaping scenarios - under the hood it is just Pandas, NumPy, and Python.
Hopefully, the various answers point you in the right direction when you need to reshape from wide to long.