Pandas split a column of unequal length lists into multiple boolean columns

Question

Given a DataFrame df1 as follows:

df1 = pd.DataFrame({
    'col1': [1,2,3,4],
    'col2': [['a', 'b'], ['c'], ['a', 'd', 'b'], ['e']]
})

Which looks like:

    col1    col2
0   1       [a, b]
1   2       [c]
2   3       [a, d, b]
3   4       [e]

I want to convert col2 - a column where each cell is a list - into several columns (a, b, c, d, e), where the values are boolean entries defining whether that column name existed in the original list, in the given row.

The output should follow this form:

df2 = pd.DataFrame({
    'col1': [1,2,3,4],
    'a': [True, False, True, False],
    'b': [True, False, True, False],
    'c': [False, True, False, False],
    'd': [False, False, True, False],
    'e': [False, False, False, True]
    
})

Which looks like:

    col1    a       b       c       d       e
0   1       True    True    False   False   False
1   2       False   False   True    False   False
2   3       True    True    False   True    False
3   4       False   False   False   False   True

What's a clean way to do this?

Shubham Sharma · Accepted Answer · 2020-09-26T11:28:23.407

Alternative approach using str.get_dummies probably more efficient than apply + pd.Series:

df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)

       a      b      c      d      e
0   True   True  False  False  False
1  False  False   True  False  False
2   True   True  False   True  False
3  False  False  False  False   True

Timings:

df1.shape
(40000, 2)

%%timeit
df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)
286 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)
9.43 s ± 499 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

score 2 · Answer 2 · answered Sep 26 '20 at 12:09

One approach

df2 = df1.reset_index().explode('col2')
new_df = pd.crosstab(df2['col1'], df2['col2']).astype(bool)
print(new_df)

col2      a      b      c      d      e
col1                                   
1      True   True  False  False  False
2     False  False   True  False  False
3      True   True  False   True  False
4     False  False  False  False   True

Or

new_df = (df1.reset_index()
              .explode('col2')
              .groupby(['index', 'col2'])['col1']
              .any()
              .unstack(fill_value=False))

new_df = (df1.reset_index()
             .explode('col2')
             .pivot_table(index='index', 
                          values='col1', 
                          columns='col2', 
                          aggfunc='any', 
                         fill_value=False) 
         )

Nice use of `crosstab` ;) – Shubham Sharma Sep 26 '20 at 12:14 — Shubham Sharma, Sep 26 '20 at 12:14

score 0 · Answer 3 · answered Sep 26 '20 at 11:11

0

Solved.

pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)

answered Sep 26 '20 at 11:11

Ian

3,605
4
31
66

Pandas split a column of unequal length lists into multiple boolean columns

3 Answers3

Linked