Split a list from Dataframe column into specific column name

Question

I have a question regarding splitting a list in a dataframe column into multiple columns. But every value that is splitted needs to be placed in a specific column.

Let's say I have this Dataframe:

date                   data
2020-01-01 00:00:00    [G07, G08, G10, G16]
2020-01-01 00:00:01    [G07, G08, G16]
2020-01-01 00:00:02    [G08, G10, G16, G20, G21]
2020-01-01 00:00:03    [G16, G20, G21, G26, G27, R02]
2020-01-01 00:00:04    [G07, G08, G26, G27]

And I'm looking for this kind of result:

date                   G07  G08  G10  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:00    G07  G08  G10  G16  NaN  NaN  NaN  NaN  NaN
2020-01-01 00:00:01    G07  G08  NaN  G16  NaN  NaN  NaN  NaN  NaN
2020-01-01 00:00:02    NaN  G08  G10  G16  G20  G21  NaN  NaN  NaN
2020-01-01 00:00:03    NaN  NaN  NaN  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:04    G07  G08  NaN  NaN  NaN  NaN  G26  G27  NaN

To finally get this kind of matrix:

date                   G07  G08  G10  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0    
2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0    
2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1    
2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0

By doing this type of command :

In [1] pd.DataFrame(self.df['data'].to_list())

Out [1] date                   1    2    3    4    5    6    
        2020-01-01 00:00:00    G07  G08  G10  G16
        2020-01-01 00:00:01    G07  G08  G16
        2020-01-01 00:00:02    G08  G10  G16  G20  G21
        2020-01-01 00:00:03    G16  G20  G21  G26  G27  R02
        2020-01-01 00:00:04    G07  G08  G26  G27

I'm only allowed to split the list into other columns. But I cannot find a way to place each value into a specific column.

I've been thinking of making loops over each values of each dates but it is very slow and I have datasets that are more than 1,000,000 rows.

I recommend [this answer](https://stackoverflow.com/a/51420716/9840637) for larger dataframes. — anky, Jun 02 '21 at 15:20

score 4 · Accepted Answer · answered Jun 02 '21 at 15:14

4

Check with MultiLabelBinarizer from sklearn

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

s = pd.DataFrame(mlb.fit_transform(df['data']),columns=mlb.classes_, index=df.index)

df = df.join(s)

answered Jun 02 '21 at 15:14

BENY

317,841
20
164
234

3

Have you seen this https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies/51420716#51420716 ? – Prayson W. Daniel Jun 02 '21 at 15:40

score 3 · Answer 2 · answered Jun 02 '21 at 15:09

Another method:

x = (
    pd.DataFrame([{k: 1 for k in v} for v in df["data"]])
    .replace(np.nan, 0)
    .astype(int)
)
print(pd.concat([df["date"], x], axis=1))

Prints:

                  date  G07  G08  G10  G16  G20  G21  G26  G27  R02
0  2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
1  2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0
2  2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0
3  2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1
4  2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0

Anurag Dabas · Answer 3 · 2021-06-02T15:13:10.897

1

try via join(),strip(), get_dummies() and drop() method:

out=df.join(df['data'].astype(str).str.strip('[]').str.get_dummies(',')).drop('data',1)

Output of out:

edited Jun 02 '21 at 15:13

answered Jun 02 '21 at 15:07

Anurag Dabas

23,866
9
21
41

1

Thank you very much for your quick response. But it seems that at output the columns G08 and G16 are duplicated. – Germain Roussez Jun 02 '21 at 15:26

Nk03 · Answer 4 · 2021-06-02T15:20:42.107

Adding one more method:

k = df.explode('data').assign(temp = 1)
df = k.pivot(*k).fillna(0)

Further transformation(if required):

df = df.rename_axis(columns=None).reset_index().convert_dtypes()

OUTPUT:

                  date  G07  G08  G10  G16  G20  G21  G26  G27  R02
0  2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
1  2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0
2  2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0
3  2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1
4  2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0

Prayson W. Daniel · Answer 5 · 2021-06-03T06:54:23.240

Doing using pandas

(dataf
    .explode("data")
    .pivot(index="date", columns="data", values="data")
    .notna()
    .astype(int))

We get the required output format:

data                  G08   G1   G10   G16   G2  ...   G27   R0  G07  G08  G16
date                                             ...                          
2020-01-01 00:00:00     1    1     1     0    0  ...     0    0    1    0    0
2020-01-01 00:00:01     1    1     0     0    0  ...     0    0    1    0    0
2020-01-01 00:00:02     0    0     1     1    1  ...     0    0    0    1    0
2020-01-01 00:00:03     0    0     0     0    0  ...     1    1    0    0    1
2020-01-01 00:00:04     1    0     0     0    1  ...     0    0    1    0    0

We explode the data column, pivot the table with date as index and data as column values. Then, use get True or False from checking na and cast to int ;)

Data and code


import io
import pandas as pd


data = io.StringIO("""
date|data
2020-01-01 00:00:00|[G07, G08, G10, G16]
2020-01-01 00:00:01|[G07, G08, G16]
2020-01-01 00:00:02|[G08, G10, G16, G20, G21]
2020-01-01 00:00:03|[G16, G20, G21, G26, G27, R02]
2020-01-01 00:00:04|[G07, G08, G26, G27]
""")

dataf = pd.read_csv(data, sep="|", parse_dates=["date"], converters={"data":lambda x: x[1:-2].split(",")})

Split a list from Dataframe column into specific column name

5 Answers5