2

I have a df that has a column whose values are either: np.nan or a variable length list of strings.

Simply put, what I want is exactly the same as the accepted answer here (from @Emre): https://datascience.stackexchange.com/questions/11797/split-a-list-of-values-into-columns-of-a-dataframe

The problem I have is the np.nan values in my column, which are absent in the accepted answer above.

When I run the code I get this error:

Traceback (most recent call last):
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 76, in <module>
    for i in frozenset.union(*fcc['JobRoleInterest']):
TypeError: descriptor 'union' for 'frozenset' objects doesn't apply to a 'float' object

So I changed all of the np.nan values to None, but now I get this:

Traceback (most recent call last):
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 76, in <module>
    for i in frozenset.union(*fcc['JobRoleInterest']):
TypeError: descriptor 'union' for 'frozenset' objects doesn't apply to a 'NoneType' object

Here is the code section I am working on:

# https://stackoverflow.com/questions/14162723/replacing-pandas-or-numpy-nan-with-a-none-to-use-with-mysqldb/54403705
# fcc = fcc.where(pd.notnull(fcc), None)  # Entire df of np.nan replaced with None
fcc['JobRoleInterest'] = fcc['JobRoleInterest'].where(pd.notnull(fcc['JobRoleInterest']), None)
# fcc['JobRoleInterest'] = None if fcc['JobRoleInterest'] == np.nan else fcc['JobRoleInterest']
for i in frozenset.union(*fcc['JobRoleInterest']):
    fcc[i] = fcc.apply(lambda _: int(i in _.i), axis=1)
MarkS
  • 1,455
  • 2
  • 21
  • 36

1 Answers1

0

If you are on pandas 0.25+, you can use explode:

df = pd.DataFrame({
    'Text': [['a','b','c'], ['b','c','d'],np.nan]
})

new_df = (df.Text.explode()
   .groupby(level=0).value_counts()
   .unstack(fill_value=0)
   .reindex(df.index, fill_value=0)
)

ret = df.join(new_df)

Output (new_df):

Text  a  b  c  d
0     1  1  1  0
1     0  1  1  1
2     0  0  0  0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • ```@QuangHoang```, your answer 99% works. When I do this: ```fcc = fcc['JobRoleInterest'].explode().groupby(level=0).value_counts().unstack(fill_value=0).reindex(fcc.index, fill_value=0)``` then the entire fcc dataframe becomes your answer. How would I append all of the newly created columns to the existing dataframe? That is how the answer I referenced above is working. – MarkS May 24 '20 at 13:21
  • ```@QuangHoang``` *much* obliged. – MarkS May 24 '20 at 13:26
  • 1
    Not sure but `pd.get_dummies(df['Text'].explode()).groupby(level=0).max()` might also be an option – Jon Clements May 24 '20 at 13:32
  • 1
    or ```df.Text.dropna().apply(pd.value_counts).reindex(df.index).fillna(0, downcast='infer')``` – Mark Wang May 24 '20 at 13:48
  • ```@Jon Clements```, your answer also works - in the same way above. A new 208 column df. – MarkS May 24 '20 at 13:48
  • ```@Mark Wang```, yours as well. – MarkS May 24 '20 at 13:50
  • @Jon Clements your answer can be further simplied as ```pd.get_dummies(df['Text'].explode()).max(level=0)``` given `max` function has built-in groupby feature – Mark Wang May 24 '20 at 13:51