exploding a pandas dataframe column

Question

I have a Pandas Dataframe that looks something like this:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2", 
          "label_1, label_3, label_2", 
          "label_2, label_4", 
          "label_1, label_2, label_5", 
          "label_2, label_3", 
          "label_3, label_5, label_1, label_2", 
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df



   text                              labels
0  abcd                    label_1, label_2
1  efgh           label_1, label_3, label_2
2  ijkl                    label_2, label_4
3  mnop           label_1, label_2, label_5
4  qrst                    label_2, label_3
5  uvwx  label_3, label_5, label_1, label_2
6    yz                    label_1, label_3

I would like to format the dataframe into something like this:

text  label_1  label_2  label_3  label_4  label_5

abcd        1.0      1.0      0.0      0.0      0.0
efgh        1.0      1.0      1.0      0.0      0.0
ijkl        0.0      1.0      0.0      1.0      0.0
mnop        1.0      1.0      0.0      0.0      1.0
qrst        0.0      1.0      1.0      0.0      0.0
uvwx        1.0      1.0      1.0      0.0      1.0
yz          1.0      0.0      1.0      0.0      0.0

How can I accomplish this? (I know I can split the strings in the labels and convert them into lists by doing something like df.labels.str.split(",") but not sure as to how to proceed from there.

(so basically I'd like to convert those keywords in the labels columns into its own columns and fill in 1 whenever they appear as shown in expected output)

Possible duplicate of [pandas: How do I split text in a column into multiple rows?](https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows) — Mohit Motwani, Aug 08 '18 at 09:36
@MohitMotwani I've tried that, it does not produce the required solution — ultron, Aug 08 '18 at 09:41

score 5 · Accepted Answer · answered Aug 08 '18 at 10:40

You can use pd.Series.str.get_dummies and combine with the text series:

dummies = df['labels'].str.replace(' ', '').str.get_dummies(',')
res = df['text'].to_frame().join(dummies)

print(res)

   text  label_1  label_2  label_3  label_4  label_5
0  abcd        1        1        0        0        0
1  efgh        1        1        1        0        0
2  ijkl        0        1        0        1        0
3  mnop        1        1        0        0        1
4  qrst        0        1        1        0        0
5  uvwx        1        1        1        0        1
6    yz        1        0        1        0        0

Shaido · Answer 2 · 2018-08-09T00:56:43.990

2

A simle solution would be to use pd.get_dummies as follows:

pd.get_dummies(
    df.set_index('text')['labels'].str.split(', ', expand=True).stack()
).groupby('text').sum()

edited Aug 09 '18 at 00:56

answered Aug 08 '18 at 09:49

Shaido

27,497
23
70
73

Nihal · Answer 3 · 2018-08-09T05:50:12.490

code:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2",
          "label_1, label_3, label_2",
          "label_2, label_4",
          "label_1, label_2, label_5",
          "label_2, label_3",
          "label_3, label_5, label_1, label_2",
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df = df.drop('labels', axis=1).join(
             df.labels
             .str
             .split(', ', expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('labels')
             )

df['value'] = 1
df_new = df.pivot(values = 'value', index='text', columns = 'labels').fillna(0)
print(df_new)

output:

labels  text  label_1  label_2  label_3  label_4  label_5
0       abcd      1.0      1.0      0.0      0.0      0.0
1       efgh      1.0      1.0      1.0      0.0      0.0
2       ijkl      0.0      1.0      0.0      1.0      0.0
3       mnop      1.0      1.0      0.0      0.0      1.0
4       qrst      0.0      1.0      1.0      0.0      0.0
5       uvwx      1.0      1.0      1.0      0.0      1.0
6         yz      1.0      0.0      1.0      0.0      0.0

in this main thing is split use (,) with space, because of you string format, if you change that format than use appropriate split.

for example:

if you are using split with single comma like this

df = df.drop('labels', axis=1).join(
                 df.labels
                 .str
                 .split(',', expand=True)
                 .stack()
                 .reset_index(drop=True, level=1)
                 .rename('labels')
                 )

then you will need additional code for removing spaces

df['labels'] = df['labels'].str.replace(" ", "")

rest of the code will be same.

score 1 · Answer 4 · edited Feb 14 '19 at 11:52

If number of columns are dynamic, this will help find the possible ones.

unique = df['labels'].apply(lambda x: x.split(", ")).values.tolist()
unique = [i for sublist in unique for i in sublist]
unique = set(unique)

Hence, unique is now.
{'label_1', 'label_2', 'label_3', 'label_4', 'label_5'}

max_label = len(unique)

Which will give us the maximum number of columns.

Answer

def labeller(labels):
    value = [0] * max_label
    for label in labels:
        value[int(label[-1])-1] = 1
    return value

df['labels'] = df['labels'].apply(lambda x: x.split(", ")).apply(labeller)

df[['label_' + str(i+1) for i in range(max_label)]] = df.labels.apply(pd.Series)
df.drop(['labels'], axis=1, inplace=True)

    text    label_1 label_2 label_3 label_4 label_5
0   abcd    1       1       0       0       0
1   efgh    1       1       1       0       0
2   ijkl    0       1       0       1       0
3   mnop    1       1       0       0       1
4   qrst    0       1       1       0       0
5   uvwx    1       1       1       0       1
6   yz      1       0       1       0       0

exploding a pandas dataframe column

4 Answers4

Answer

Linked