1

I have a Series with values as lists of varying elements. Value count shows like this.

category                                                                            count
[Radiometric]                                                                       76
[Ozone]                                                                             59
[Aerosol]                                                                           53
[Cryosphere]                                                                        31
[Atmospheric State, Cloud Properties]                                               29
[Atmospheric State, Radiometric, Surface Properties]                                 8
[POPs]                                                                               8
[Atmospheric State, Cloud Properties, Radiometric]                                   7

I want to create columns for each of this category and mark True/False for each row.

e.g.

index                Aerosol    Cloud Properities     Radiometric  ......
1                     TRUE       FALSE                  TRUE
2                     FALSE       TRUE                   TRUE
3
4

I managed to get the unique list of these categories from all items. I also could make it into separate columns by using method given in solution here.

But in my case data is incomplete/varied and thus give me a DF like below

    1                   2                   3                 4                 5
25  Reactive Gas        Surface Properties  None               None             None
28  Aerosol             Ozone               Atmospheric State Cloud Properties  None
59  Surface Properties  Cryosphere          None               None             None
68  Atmospheric State   Cloud Properties    None               None             None
73  Atmospheric State   Radiometric         None               None             None

is there a way using pandas or other python tools to covert it into the required output. I am now working with pandas.pivot_table with hint from this solution. I am using column 1 as the columns (assuming it has all the categories), but gives me a multi-index DF for each column.

pvt = tmp.pivot_table(index=tmp.index,columns="1",aggfunc="count")

Need help on how can I get to the boolean matrix/df as mentioned above.

arundeep78
  • 159
  • 11

1 Answers1

0

I think you need Series.str.join with Series.str.get_dummies and convert to boolean:

df1 = df.category.str.join('|').str.get_dummies().astype(bool)

Or use MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df.category),columns=mlb.classes_).astype(bool)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks a lot for quick response. I used first method with str.get_dummies and it works like a charm. I have been going through it for hours with various methods. Brilliant!. Accepted answer – arundeep78 Jan 27 '21 at 14:26
  • 1
    No, After I marked it as answer, I noticed some gap in data. I thought i reacted too early. later I figured it was an error in some variable that I added. So, I accepted the answer again. Thanks once again. – arundeep78 Jan 28 '21 at 22:37