2

Simple DataFrame with sets columns:

df = pd.DataFrame({'a': [{0,1}, {1,2}, {}], 'b': [{1,2},{2,3,4}, {3}]})
df
        a          b
0  {0, 1}     {1, 2}
1  {1, 2}  {2, 3, 4}
2      {}        {3}

I want to transform multiple specific sets columns into lists columns. I'm using apply and this doesn't work:

df[['a','b']].apply(lambda x: list(x))
        a          b
0  {0, 1}     {1, 2}
1  {1, 2}  {2, 3, 4}
2      {}        {3}

It works for a single column / Series though:

df['a'].apply(lambda x: list(x))
0    [0, 1]
1    [1, 2]
2        []
Name: a, dtype: object

And a different function, on a different DataFrame not involving lists, of course works on multiple columns as expected:

df2 = pd.DataFrame({'a':[0,1,2], 'b':[3,4,5]})
df2[['a','b']].apply(lambda x: x + 1)
   a  b
0  1  4
1  2  5
2  3  6

So is there a one-liner for what I want to do without traversing through columns?

Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72

2 Answers2

4

I think you are looking for applymap. Also, lambda x: list(x) can be simplified to just list:

In [5]: df[['a', 'b']].applymap(list)
Out[5]:
        a          b
0  [0, 1]     [1, 2]
1  [1, 2]  [2, 3, 4]
2      []        [3]
iz_
  • 15,923
  • 3
  • 25
  • 40
  • 1
    Going to mark this answer as correct, although @coldspeed has demonstrated his solution to be faster, because (a) the elegance and (b) for the average future user this would probably suffice. – Giora Simchoni Jan 24 '19 at 06:16
  • 1
    @GioraSimchoni Accepts are not important to me. I am quite satisfied as long as you were able to take away something useful. Thanks! – cs95 Jan 24 '19 at 06:17
2

Try using a nested list comprehension for performance:

pd.DataFrame([[list(l) for l in r] for r in df.values], 
             index=df.index,
             columns=df.columns)

        a          b
0  [0, 1]     [1, 2]
1  [1, 2]  [2, 3, 4]
2      []        [3]

When it comes to dealing with mixed dtypes, I fully believe in the power of pure-python. For more information on when loops trump pandas, take a look at my writeup here: For loops with pandas - When should I care?

The difference is obvious, even for tiny frames:

%timeit df[['a', 'b']].applymap(list)
%%timeit
pd.DataFrame([[list(l) for l in r] for r in df.values], 
             index=df.index,
             columns=df.columns)

3.41 ms ± 92 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
669 µs ± 63.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cs95
  • 379,657
  • 97
  • 704
  • 746