Create feature columns from a single column of lists

Question

I have a df that has a column whose values are either: np.nan or a variable length list of strings.

Simply put, what I want is exactly the same as the accepted answer here (from @Emre): https://datascience.stackexchange.com/questions/11797/split-a-list-of-values-into-columns-of-a-dataframe

The problem I have is the np.nan values in my column, which are absent in the accepted answer above.

When I run the code I get this error:

Traceback (most recent call last):
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 76, in <module>
    for i in frozenset.union(*fcc['JobRoleInterest']):
TypeError: descriptor 'union' for 'frozenset' objects doesn't apply to a 'float' object

So I changed all of the np.nan values to None, but now I get this:

Traceback (most recent call last):
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 76, in <module>
    for i in frozenset.union(*fcc['JobRoleInterest']):
TypeError: descriptor 'union' for 'frozenset' objects doesn't apply to a 'NoneType' object

Here is the code section I am working on:

# https://stackoverflow.com/questions/14162723/replacing-pandas-or-numpy-nan-with-a-none-to-use-with-mysqldb/54403705
# fcc = fcc.where(pd.notnull(fcc), None)  # Entire df of np.nan replaced with None
fcc['JobRoleInterest'] = fcc['JobRoleInterest'].where(pd.notnull(fcc['JobRoleInterest']), None)
# fcc['JobRoleInterest'] = None if fcc['JobRoleInterest'] == np.nan else fcc['JobRoleInterest']
for i in frozenset.union(*fcc['JobRoleInterest']):
    fcc[i] = fcc.apply(lambda _: int(i in _.i), axis=1)

Quang Hoang · Accepted Answer · 2020-05-24T13:22:52.713

0

If you are on pandas 0.25+, you can use explode:

df = pd.DataFrame({
    'Text': [['a','b','c'], ['b','c','d'],np.nan]
})

new_df = (df.Text.explode()
   .groupby(level=0).value_counts()
   .unstack(fill_value=0)
   .reindex(df.index, fill_value=0)
)

ret = df.join(new_df)

Output (new_df):

Text  a  b  c  d
0     1  1  1  0
1     0  1  1  1
2     0  0  0  0

edited May 24 '20 at 13:22

answered May 24 '20 at 13:13

Quang Hoang

146,074
10
56
74

```@QuangHoang```, your answer 99% works. When I do this: ```fcc = fcc['JobRoleInterest'].explode().groupby(level=0).value_counts().unstack(fill_value=0).reindex(fcc.index, fill_value=0)``` then the entire fcc dataframe becomes your answer. How would I append all of the newly created columns to the existing dataframe? That is how the answer I referenced above is working. – MarkS May 24 '20 at 13:21
```@QuangHoang``` *much* obliged. – MarkS May 24 '20 at 13:26
1

Not sure but `pd.get_dummies(df['Text'].explode()).groupby(level=0).max()` might also be an option – Jon Clements May 24 '20 at 13:32
1

or ```df.Text.dropna().apply(pd.value_counts).reindex(df.index).fillna(0, downcast='infer')``` – Mark Wang May 24 '20 at 13:48
```@Jon Clements```, your answer also works - in the same way above. A new 208 column df. – MarkS May 24 '20 at 13:48
```@Mark Wang```, yours as well. – MarkS May 24 '20 at 13:50
@Jon Clements your answer can be further simplied as ```pd.get_dummies(df['Text'].explode()).max(level=0)``` given `max` function has built-in groupby feature – Mark Wang May 24 '20 at 13:51

Create feature columns from a single column of lists

1 Answers1