Flatten Unflatten the Data Using Pandas

Question

How to change the columns containing Small_X, Large_X (Where X is numbers 1,2,3,... etc) need to be un-flattened with all other values propagated to the new records and a new column called "New_Column" will be added that denotes the X value.

import pandas as pd
data = {'Time':  ['12/1/19 0:00', '12/1/19 0:05'],
    'Small_1': [1, 0],
    'Large_1': [0, 0],
    'Con_1': [0, 0],
    'Small_2': [0, 0],
    'Large_2': [0, 0],
    'Con_2': [0, 0],
    'Small_10': [1, 0],
    'large_10': [0, 0],
    'Con_10': [0, 0],
    'Some_other_value': [78, 96],
    }
df = pd.DataFrame(data)

Above input data should result in the following Format using Pandas.

Please include a [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), instead of images. — sushanth, Jul 04 '20 at 05:39
(1) As mentioned above, a code that creates few rows of the DataFrame shown in the top screenshot would help. (2) Also, it is unclear how to handle the `1` from say `Small_10`. Would that replace the `1` from `Small_1` etc. Assuming each timestamp has different values in each set of columns for `Small`, `Large`, and `Con`. — Nilesh Ingle, Jul 04 '20 at 07:16
The above screenshot is already created CSV in which all the attributes are there numbered as small_1, large_1, con_1 .... small_10 and so on. Timestamp should be grouped later so that for each small_X, large_X it converted to small and Large resp. @NileshIngle — Mr.Robot, Jul 04 '20 at 07:36
@Mr.Robot: (1) Sorry for not being clear, I meant a code that we can copy/paste to create the `DataFrame` in question would be helpful. So that we do not have to manully type the numbers from the `Screenshot 1` to create the dataframe on our side. (2) Would the timestamp say `12-01-2019 00:00` be repeated `10 times` when the columns `Small_1` to `Small_10` are converted to the format shown in `Screenshot 2`? Assuming then, that the `Screenshot 2` posted above is a after the data is `grouped`. — Nilesh Ingle, Jul 04 '20 at 07:46
(1) Okay I'm adding the sampe DataFrame in the Question so that it would be more convenient. (2) Yes it would repeat for every X but not for small large or con. @NileshIngle — Mr.Robot, Jul 04 '20 at 08:33
@NileshIngle I updated the question. I hope this would be more convenient — Mr.Robot, Jul 04 '20 at 08:43
@Mr.Robot: Thanks. I've posted the answer that outputs aggregate value of `np.max` — Nilesh Ingle, Jul 04 '20 at 11:48

Mathia Haure-Touzé · Answer 1 · 2020-07-06T07:19:02.653

1

I am proposing first to filter columns of interest, then to melt the columns, then split columns name on "_" then pivot the dataframe:

# Filter columns
df = df.filter(regex=r"^([Cc]on|[Ll]arge|[Ss]mall|Time).*") 

# Melt dataframe
new = df.melt(id_vars="Time")

# Split column name
new[["variable", "New_Column"]] = new.variable.str.split("_", expand=True)

# Set variable as title
new["variable"] = new.variable.str.title()

# Pivot dataframe
new = pd.pivot_table(new, index=["Time", "New_Column"], values="value", columns="variable")

print(new.reset_index())

edited Jul 06 '20 at 07:19

answered Jul 04 '20 at 08:46

Mathia Haure-Touzé

311
3
6

@Mr.Robot, I updated my answer to select columns of interest – Mathia Haure-Touzé Jul 06 '20 at 07:09

score 1 · Accepted Answer · answered Jul 04 '20 at 11:47

Alternative code:

Code below uses .stack(), .unstack() and .pivot_table(). It selectively only picks column names that have the strings small, large,con specified in list key to be split.

Note:

The output is aggregated by aggfunc=np.max in the .pivot_table(). This could be changed to any other type of aggregation method as needed such as np.min, np.sum, or a custom function.
The original column names were changed to lower case, except Time

Code:

# Import libraries
import pandas as pd

# Create DataFrame (copy pasted from question above)
data = {'Time':  ['12/1/19 0:00', '12/1/19 0:05'],
    'Small_1': [1, 0],
    'Large_1': [0, 0],
    'Con_1': [0, 0],
    'Small_2': [0, 0],
    'Large_2': [0, 0],
    'Con_2': [0, 0],
    'Small_10': [1, 0],
    'large_10': [0, 0],
    'Con_10': [0, 0],
    'Some_other_value': [78, 96],
    }
df = pd.DataFrame(data)

# Set Time as index
df = df.set_index('Time')

# Rename columns
df.columns = df.columns.str.lower() # change case to lower

# Stack
df = df.stack().reset_index() # convert columns to rows

# Split based on condition
key = ['small', 'large','con'] # Column names to be split
df['col1'] = df['level_1'].apply(lambda x: x.split('_')[0] if x.split('_')[0] in key else x)
df['New_Column'] = df['level_1'].apply(lambda x: x.split('_')[1] if x.split('_')[0] in key else np.NaN)

# Drop/rename columns
df = df.drop(['level_1'], axis=1)
df.columns.name=''

# Pivot using aggregate function: np.max
df = df.pivot_table(index=['Time', 'New_Column'], columns='col1', values=0, aggfunc=np.max)

# Rearrange
df = df.reset_index()
df.columns.name=''
df = df[['Time','small', 'large', 'con', 'New_Column']]

Output

print(df)

           Time  small  large  con New_Column
0  12/1/19 0:00      1      0    0          1
1  12/1/19 0:00      1      0    0         10
2  12/1/19 0:00      0      0    0          2
3  12/1/19 0:05      0      0    0          1
4  12/1/19 0:05      0      0    0         10
5  12/1/19 0:05      0      0    0          2

That's pretty much according to my requirement. Need to do few changes. Else it was fine. thanks a lot @Nilesh — Mr.Robot, Jul 06 '20 at 05:46

Flatten Unflatten the Data Using Pandas

2 Answers2