4

I have a Pandas Dataframe that looks something like this:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2", 
          "label_1, label_3, label_2", 
          "label_2, label_4", 
          "label_1, label_2, label_5", 
          "label_2, label_3", 
          "label_3, label_5, label_1, label_2", 
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df



   text                              labels
0  abcd                    label_1, label_2
1  efgh           label_1, label_3, label_2
2  ijkl                    label_2, label_4
3  mnop           label_1, label_2, label_5
4  qrst                    label_2, label_3
5  uvwx  label_3, label_5, label_1, label_2
6    yz                    label_1, label_3

I would like to format the dataframe into something like this:

text  label_1  label_2  label_3  label_4  label_5

abcd        1.0      1.0      0.0      0.0      0.0
efgh        1.0      1.0      1.0      0.0      0.0
ijkl        0.0      1.0      0.0      1.0      0.0
mnop        1.0      1.0      0.0      0.0      1.0
qrst        0.0      1.0      1.0      0.0      0.0
uvwx        1.0      1.0      1.0      0.0      1.0
yz          1.0      0.0      1.0      0.0      0.0

How can I accomplish this? (I know I can split the strings in the labels and convert them into lists by doing something like df.labels.str.split(",") but not sure as to how to proceed from there.

(so basically I'd like to convert those keywords in the labels columns into its own columns and fill in 1 whenever they appear as shown in expected output)

jpp
  • 159,742
  • 34
  • 281
  • 339
ultron
  • 442
  • 8
  • 16

4 Answers4

5

You can use pd.Series.str.get_dummies and combine with the text series:

dummies = df['labels'].str.replace(' ', '').str.get_dummies(',')
res = df['text'].to_frame().join(dummies)

print(res)

   text  label_1  label_2  label_3  label_4  label_5
0  abcd        1        1        0        0        0
1  efgh        1        1        1        0        0
2  ijkl        0        1        0        1        0
3  mnop        1        1        0        0        1
4  qrst        0        1        1        0        0
5  uvwx        1        1        1        0        1
6    yz        1        0        1        0        0
jpp
  • 159,742
  • 34
  • 281
  • 339
2

A simle solution would be to use pd.get_dummies as follows:

pd.get_dummies(
    df.set_index('text')['labels'].str.split(', ', expand=True).stack()
).groupby('text').sum()
Shaido
  • 27,497
  • 23
  • 70
  • 73
2

code:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2",
          "label_1, label_3, label_2",
          "label_2, label_4",
          "label_1, label_2, label_5",
          "label_2, label_3",
          "label_3, label_5, label_1, label_2",
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df = df.drop('labels', axis=1).join(
             df.labels
             .str
             .split(', ', expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('labels')
             )

df['value'] = 1
df_new = df.pivot(values = 'value', index='text', columns = 'labels').fillna(0)
print(df_new)

output:

labels  text  label_1  label_2  label_3  label_4  label_5
0       abcd      1.0      1.0      0.0      0.0      0.0
1       efgh      1.0      1.0      1.0      0.0      0.0
2       ijkl      0.0      1.0      0.0      1.0      0.0
3       mnop      1.0      1.0      0.0      0.0      1.0
4       qrst      0.0      1.0      1.0      0.0      0.0
5       uvwx      1.0      1.0      1.0      0.0      1.0
6         yz      1.0      0.0      1.0      0.0      0.0

in this main thing is split use (,) with space, because of you string format, if you change that format than use appropriate split.

for example:

if you are using split with single comma like this

df = df.drop('labels', axis=1).join(
                 df.labels
                 .str
                 .split(',', expand=True)
                 .stack()
                 .reset_index(drop=True, level=1)
                 .rename('labels')
                 )

then you will need additional code for removing spaces

df['labels'] = df['labels'].str.replace(" ", "")

rest of the code will be same.

Nihal
  • 5,262
  • 7
  • 23
  • 41
1

If number of columns are dynamic, this will help find the possible ones.

unique = df['labels'].apply(lambda x: x.split(", ")).values.tolist()
unique = [i for sublist in unique for i in sublist]
unique = set(unique)

Hence, unique is now.
{'label_1', 'label_2', 'label_3', 'label_4', 'label_5'}

max_label = len(unique)

Which will give us the maximum number of columns.

Answer

def labeller(labels):
    value = [0] * max_label
    for label in labels:
        value[int(label[-1])-1] = 1
    return value

df['labels'] = df['labels'].apply(lambda x: x.split(", ")).apply(labeller)

df[['label_' + str(i+1) for i in range(max_label)]] = df.labels.apply(pd.Series)
df.drop(['labels'], axis=1, inplace=True)

    text    label_1 label_2 label_3 label_4 label_5
0   abcd    1       1       0       0       0
1   efgh    1       1       1       0       0
2   ijkl    0       1       0       1       0
3   mnop    1       1       0       0       1
4   qrst    0       1       1       0       0
5   uvwx    1       1       1       0       1
6   yz      1       0       1       0       0
Nihal
  • 5,262
  • 7
  • 23
  • 41
Raunaq Jain
  • 917
  • 7
  • 13