4

I have a question regarding splitting a list in a dataframe column into multiple columns. But every value that is splitted needs to be placed in a specific column.

Let's say I have this Dataframe:

date                   data
2020-01-01 00:00:00    [G07, G08, G10, G16]
2020-01-01 00:00:01    [G07, G08, G16]
2020-01-01 00:00:02    [G08, G10, G16, G20, G21]
2020-01-01 00:00:03    [G16, G20, G21, G26, G27, R02]
2020-01-01 00:00:04    [G07, G08, G26, G27]

And I'm looking for this kind of result:

date                   G07  G08  G10  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:00    G07  G08  G10  G16  NaN  NaN  NaN  NaN  NaN
2020-01-01 00:00:01    G07  G08  NaN  G16  NaN  NaN  NaN  NaN  NaN
2020-01-01 00:00:02    NaN  G08  G10  G16  G20  G21  NaN  NaN  NaN
2020-01-01 00:00:03    NaN  NaN  NaN  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:04    G07  G08  NaN  NaN  NaN  NaN  G26  G27  NaN

To finally get this kind of matrix:

date                   G07  G08  G10  G16  G20  G21  G26  G27  R02
2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0    
2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0    
2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1    
2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0    

By doing this type of command :

In [1] pd.DataFrame(self.df['data'].to_list())

Out [1] date                   1    2    3    4    5    6    
        2020-01-01 00:00:00    G07  G08  G10  G16
        2020-01-01 00:00:01    G07  G08  G16
        2020-01-01 00:00:02    G08  G10  G16  G20  G21
        2020-01-01 00:00:03    G16  G20  G21  G26  G27  R02
        2020-01-01 00:00:04    G07  G08  G26  G27

I'm only allowed to split the list into other columns. But I cannot find a way to place each value into a specific column.

I've been thinking of making loops over each values of each dates but it is very slow and I have datasets that are more than 1,000,000 rows.

  • I recommend [this answer](https://stackoverflow.com/a/51420716/9840637) for larger dataframes. – anky Jun 02 '21 at 15:20

5 Answers5

4

Check with MultiLabelBinarizer from sklearn

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

s = pd.DataFrame(mlb.fit_transform(df['data']),columns=mlb.classes_, index=df.index)

df = df.join(s)
BENY
  • 317,841
  • 20
  • 164
  • 234
3

Another method:

x = (
    pd.DataFrame([{k: 1 for k in v} for v in df["data"]])
    .replace(np.nan, 0)
    .astype(int)
)
print(pd.concat([df["date"], x], axis=1))

Prints:

                  date  G07  G08  G10  G16  G20  G21  G26  G27  R02
0  2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
1  2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0
2  2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0
3  2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1
4  2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

try via join(),strip(), get_dummies() and drop() method:

out=df.join(df['data'].astype(str).str.strip('[]').str.get_dummies(',')).drop('data',1)

Output of out:

enter image description here

Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
1

Adding one more method:

k = df.explode('data').assign(temp = 1)
df = k.pivot(*k).fillna(0)

Further transformation(if required):

df = df.rename_axis(columns=None).reset_index().convert_dtypes()

OUTPUT:

                  date  G07  G08  G10  G16  G20  G21  G26  G27  R02
0  2020-01-01 00:00:00    1    1    1    1    0    0    0    0    0
1  2020-01-01 00:00:01    1    1    0    1    0    0    0    0    0
2  2020-01-01 00:00:02    0    1    1    1    1    1    0    0    0
3  2020-01-01 00:00:03    0    0    0    1    1    1    1    1    1
4  2020-01-01 00:00:04    1    1    0    0    0    0    1    1    0
Nk03
  • 14,699
  • 2
  • 8
  • 22
1

Doing using pandas

(dataf
    .explode("data")
    .pivot(index="date", columns="data", values="data")
    .notna()
    .astype(int))

We get the required output format:

data                  G08   G1   G10   G16   G2  ...   G27   R0  G07  G08  G16
date                                             ...                          
2020-01-01 00:00:00     1    1     1     0    0  ...     0    0    1    0    0
2020-01-01 00:00:01     1    1     0     0    0  ...     0    0    1    0    0
2020-01-01 00:00:02     0    0     1     1    1  ...     0    0    0    1    0
2020-01-01 00:00:03     0    0     0     0    0  ...     1    1    0    0    1
2020-01-01 00:00:04     1    0     0     0    1  ...     0    0    1    0    0

We explode the data column, pivot the table with date as index and data as column values. Then, use get True or False from checking na and cast to int ;)

Data and code

import io
import pandas as pd


data = io.StringIO("""
date|data
2020-01-01 00:00:00|[G07, G08, G10, G16]
2020-01-01 00:00:01|[G07, G08, G16]
2020-01-01 00:00:02|[G08, G10, G16, G20, G21]
2020-01-01 00:00:03|[G16, G20, G21, G26, G27, R02]
2020-01-01 00:00:04|[G07, G08, G26, G27]
""")

dataf = pd.read_csv(data, sep="|", parse_dates=["date"], converters={"data":lambda x: x[1:-2].split(",")})
Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57