24

I have a pandas data frame. One of the columns contains a list. I want that column to be a single string.

For example my list ['one','two','three'] should simply be 'one, two, three'

df['col'] = df['col'].astype(str).apply(lambda x: ', '.join(df['col'].astype(str)))

gives me ['one, two, three],['four','five','six'] where the second list is from the next row. Needless to say with millions of rows this concatenation across rows is not only incorrect, it kills my memory.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Rusty Coder
  • 275
  • 1
  • 2
  • 6

4 Answers4

40

You should certainly not convert to string before you transform the list. Try:

df['col'].apply(', '.join)

Also note that apply applies the function to the elements of the series, so using df['col'] in the lambda function is probably not what you want.


Or, there is a native .str.join method, but it is (surprisingly) a bit slower than apply.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
IanS
  • 15,771
  • 9
  • 60
  • 84
  • 1
    what if items of the list are `int` types – Khalil Al Hooti Aug 23 '18 at 10:01
  • 4
    @KhalilAlHooti this should work: `df['col'].apply(lambda x: ', '.join(map(str, x)))` – IanS Aug 23 '18 at 10:25
  • what about if there is some nan data. `df2['well_array2'].apply(lambda x: ','.join(str(i) for i in x if pd.isnull(all(x))==False)) ` did not work. I am getting error float' object is not iterable – Khalil Al Hooti Aug 23 '18 at 12:16
  • neither this is working, getting the same error. `df2['well_array2'].apply(lambda x: ','.join(str(i) for i in x if type(x) == type([])))` – Khalil Al Hooti Aug 23 '18 at 12:24
  • @IanS the entire x is nan, there is no list at this row in well_array2 column. Thanks in advance. I do not want to filter out nan values! – Khalil Al Hooti Aug 23 '18 at 12:32
  • 4
    @KhalilAlHooti this should work: `df['new_col'] = df['col'].dropna().apply(lambda x: ', '.join(map(str, x)))` – IanS Aug 23 '18 at 12:33
  • 2
    This filters out null values, but reassigns them when you create the new column (because pandas does index-based assignment, missing values are assigned nan). – IanS Aug 23 '18 at 12:34
  • 1
    I'm not a Pandas expert, but it seems like the reason to prefer `.str.join()` is that it accounts for NaN and other objects automatically. For example use `pd.Series([np.nan, ['one', 'two', 'three']])` and compare the two methods. – wjandrea Jan 25 '22 at 23:06
13

When you cast col to str with astype, you get a string representation of a python list, brackets and all. You do not need to do that, just apply join directly:

import pandas as pd

df = pd.DataFrame({
    'A': [['a', 'b', 'c'], ['A', 'B', 'C']]
    })

# Out[8]: 
#            A
# 0  [a, b, c]
# 1  [A, B, C]

df['Joined'] = df.A.apply(', '.join)

#            A   Joined
# 0  [a, b, c]  a, b, c
# 1  [A, B, C]  A, B, C
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
9

You could convert your list to str with astype(str) and then remove ', [, ] characters. Using @Yakim example:

In [114]: df
Out[114]:
           A
0  [a, b, c]
1  [A, B, C]

In [115]: df.A.astype(str).str.replace('\[|\]|\'', '')
Out[115]:
0    a, b, c
1    A, B, C
Name: A, dtype: object

Timing

import pandas as pd
df = pd.DataFrame({'A': [['a', 'b', 'c'], ['A', 'B', 'C']]})
df = pd.concat([df]*1000)


In [2]: timeit df['A'].apply(', '.join)
292 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: timeit df['A'].str.join(', ')
368 µs ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: timeit df['A'].apply(lambda x: ', '.join(x))
505 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: timeit df['A'].str.replace('\[|\]|\'', '')
2.43 ms ± 62.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
  • Interesting approach. However, I timed your method, and it is slower than `apply` by a factor of 5. – IanS May 20 '16 at 13:43
  • @IanS how did you time it? I thought vectorized string operations as shown in this answer would be much faster than apply – alys Dec 07 '17 at 00:04
  • @Shoof I used the `%timeit` magic in IPython. I checked again and I find 3 times as slow on a column with 100 rows. Two possible explanations: 1) Regex replace operations, even vectorized, can be quite slow. 2) Apply is smart, e.g. if you apply a standard function such as `sum` it will be quite fast. I assume something similar may be going on with `join`. – IanS Dec 07 '17 at 10:03
  • 1
    @Shoof @IanS I edited answer to add timings. And add new method with `str.join` which is on 2nd place after `.apply(', '.join)` – Anton Protopopov Dec 07 '17 at 12:13
  • 1
    @IanS thanks a lot for the timed results! That's indeed a bit surprising as some books suggested otherwise. Great to see the comparisons! – alys Dec 16 '17 at 04:41
  • It's interesting that `str.join` is slower (but then, only by a bit). Note that operations on objects cannot really be vectorized since an `object` column is just a column of pointers (here, to lists). – IanS May 28 '20 at 07:30
2

Pandas offers a method for this, Series.str.join.

AMC
  • 2,642
  • 7
  • 13
  • 35
  • `Series.str.join(', ')` works well for me. As stated by the documentation, "The lists containing object(s) of types other than str will produce a NaN." For my use case, this was preferable to an error, which is what I got when using `Series.apply(', '.join)`. – Kyle Jul 07 '23 at 15:24