How to concatenate multiple column values into a single column in Pandas dataframe

Question

This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:

Here is the combining two columns:

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)

df
    bar foo new combined
0   1   a   apple   a_1
1   2   b   banana  b_2
2   3   c   pear    c_3

I want to combine three columns with this command but it is not working, any idea?

df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

if you want to concat 3 columns you need 3 %s. (**%s_%s_%s**) like `df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)` — user2652620, Nov 09 '17 at 14:33
Possible duplicate of [String concatenation of two pandas columns](https://stackoverflow.com/questions/11858472/string-concatenation-of-two-pandas-columns) — MrFun, Mar 18 '19 at 03:10
A more comprehensive answer showing timings for multiple approaches is [Combine two columns of text in pandas dataframe](https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-pandas-dataframe) — smci, Mar 13 '21 at 04:16
Your reference post later has `df.astype(str).agg('_'.join, axis=1)`. — Ynjxsjmh, Apr 20 '22 at 07:33

score 176 · Answer 1 · answered Sep 11 '18 at 06:53

176

Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:

cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

answered Sep 11 '18 at 06:53

Allen

2,195
1
14
9

5

This is the best solution when the column list is saved as a variable and can hold a different amount of columns every time – M_Idk392845 Nov 26 '20 at 22:52
4

Tiny gotcha I ran into was that `.values.astype(str)` converts `None` into the string `'None'` rather than an empty string. Apparently. – grofte Apr 13 '21 at 11:21
9

**without lambda** (faster and more concise): `df[cols].astype(str).apply('_'.join, axis=1)`. That said, using `.str.cat(...).str.cat(...)...` is faster still. – Pierre D Oct 02 '21 at 20:08

score 110 · Answer 2 · edited Jul 19 '22 at 16:40

110

You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.

In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']

In[17]:df
Out[18]: 
   bar foo     new    combined
0    1   a   apple   1_a_apple
1    2   b  banana  2_b_banana
2    3   c    pear    3_c_pear

edited Jul 19 '22 at 16:40

Bill the Lizard

398,270
210
566
880

answered Sep 02 '16 at 11:43

shivsn

7,680
1
26
33

11

this solution will be much faster compared to the `.apply(, axis=1)` one on bigger DFs – MaxU - stand with Ukraine Sep 02 '16 at 11:58
2

@MaxU yeah and its very easy. – shivsn Sep 02 '16 at 12:27
I've added a [comparison](http://stackoverflow.com/a/39293567/5741205) against 30K rows DF... – MaxU - stand with Ukraine Sep 02 '16 at 13:25
2

I'm getting a `SettingWithCopyWarning` when I use this solution - how could I avoid triggering that warning? – Nate Jul 13 '18 at 19:00
4

This gets annoying when you need to join many columns, however. – derchambers Oct 11 '18 at 22:42
3

If any of the columns are `None`, `df['combined']` becomes `nan`. Example: if `df.new.iloc[0] == None`, then `df.combined.iloc[0]` becomes `nan`, instead of `1_a_` – Avantika Banerjee Jul 17 '19 at 11:16
what is I want to combine the columns based on a condition such as (if ```df['bar']==1```)? – wawawa Jul 28 '20 at 21:50

score 29 · Answer 3 · answered May 24 '18 at 08:39

29

If you have even more columns you want to combine, using the Series method str.cat might be handy:

df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")

Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).

answered May 24 '18 at 08:39

cbrnr

1,564
1
14
28

1

Clever, but this caused a huge memory error for me. Tedious as it may be, writing `df[col].map(str) + '_' df[col2].map(str) + ... + df[col9].map(str)` is way more efficient. – Corey Levinson Sep 09 '19 at 01:23
1

It's interesting! I didn't know we can use DataFrame as an argument in `Series.str.cat()` – MaxU - stand with Ukraine Nov 15 '19 at 09:14
1

This is by far the easiest for me, and I like the sep parameter – avirr Feb 24 '20 at 19:58
No memory issues for me. Has to add `df["foo"].fillna('')`. – citynorman Mar 06 '22 at 06:24

score 18 · Answer 4 · answered Sep 02 '16 at 13:24

Just wanted to make a time comparison for both solutions (for 30K rows DF):

In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

In [2]: big = pd.concat([df] * 10**4, ignore_index=True)

In [3]: big.shape
Out[3]: (30000, 3)

In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop

In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop

a few more options:

In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop

In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop

Very nice with additional options. – shivsn Sep 02 '16 at 13:28 — shivsn, Sep 02 '16 at 13:28

score 12 · Answer 5 · answered Jun 01 '20 at 15:42

Possibly the fastest solution is to operate in plain Python:

Series(
    map(
        '_'.join,
        df.values.tolist()
        # when non-string columns are present:
        # df.values.astype(str).tolist()
    ),
    index=df.index
)

Comparison against @MaxU answer (using the big data frame which has both numeric and string columns):

%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparison against @derchambers answer (using their df data frame where all columns are strings):

from functools import reduce

def reduce_join(df, columns):
    slist = [df[x] for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])

def list_map(df, columns):
    return Series(
        map(
            '_'.join,
            df[columns].values.tolist()
        ),
        index=df.index
    )

%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

derchambers · Answer 6 · 2020-04-17T20:45:49.157

9

The answer given by @allen is reasonably generic but can lack in performance for larger dataframes:

Reduce does a lot better:

from functools import reduce

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'


def reduce_join(df, columns):
    assert len(columns) > 1
    slist = [df[x].astype(str) for x in columns]
    return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])


def apply_join(df, columns):
    assert len(columns) > 1
    return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)

# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)

# profile
%timeit df1 = reduce_join(df, list('1234'))  # 733 ms
%timeit df2 = apply_join(df, list('1234'))   # 8.84 s

edited Apr 17 '20 at 20:45

answered Apr 17 '20 at 20:32

derchambers

904
13
19

Is there a way to not abandon the empty cells, without adding a separator, for example, the strings to join is "", "a" and "b", the expected result is "_a_b", but is it possible to have "a_b". I couldn't find a way to do this efficiently, because it requires row wise operation, since the length of each row is different. – Yang Jun 12 '20 at 13:01
I am not sure what you mean @Yang, maybe post a new question with a workable example? – derchambers Jun 12 '20 at 22:12

score 8 · Answer 7 · answered Sep 02 '16 at 11:43

8

I think you are missing one %s

df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

answered Sep 02 '16 at 11:43

milos.ai

3,882
7
31
33

score 7 · Answer 8 · answered Feb 28 '22 at 08:32

First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here

# Initialize columns
cols_concat = ['first_name', 'second_name']

# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')

# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)

Manivannan Murugavel · Answer 9 · 2018-04-19T07:59:28.047

2

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)

If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.

edited Apr 19 '18 at 07:59

answered Apr 18 '18 at 10:10

Manivannan Murugavel

1,476
17
14

score 2 · Answer 10 · edited Oct 12 '18 at 13:10

2

df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']

X= x is any delimiter (eg: space) by which you want to separate two merged column.

edited Oct 12 '18 at 13:10

Papershine

4,995
2
24
48

answered Oct 12 '18 at 13:06

Nipun Kumar Goel

201
2
6

Grzegorz · Answer 11 · 2021-04-26T07:46:14.997

2

@derchambers I found one more solution:

import pandas as pd

# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'

def eval_join(df, columns):

    sum_elements = [f"df['{col}']" for col in columns]
    to_eval = "+ '_' + ".join(sum_elements)

    return eval(to_eval)


#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms

edited Apr 26 '21 at 07:46

answered Apr 22 '20 at 12:44

Grzegorz

1,268
11
11

score 2 · Answer 12 · answered Nov 27 '20 at 14:53

If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do

def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
    df[new_col_name] = df[cols_to_concat[0]]
    for col in cols_to_concat[1:]:
        df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)

This should be faster than apply and takes an arbitrary number of columns to concatenate.

score 2 · Answer 13 · answered Dec 02 '21 at 13:03

You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):

def concat_cols(df, cols_to_concat, new_col_name, separator):  
    df[new_col_name] = ''
    for i, col in enumerate(cols_to_concat):
        df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
    return df

Sample usage:

test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')

score 0 · Answer 14 · answered Sep 02 '22 at 16:25

0

following to @Allen response
If you need to chain such operation with other dataframe transformation, use assign:

df.assign(
    combined = lambda x: x[cols].apply(
        lambda row: "_".join(row.values.astype(str)), axis=1
  )
)

answered Sep 02 '22 at 16:25

Antiez

679
7
11

Gonçalo Peres · Answer 15 · 2022-09-20T09:49:26.220

Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work

df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.

columns = ['foo', 'bar', 'new']

df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)

[Out]:
  foo  bar     new    combined
0   a    1   apple   a_1_apple
1   b    2  banana  b_2_banana
2   c    3    pear    c_3_pear

This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.

How to concatenate multiple column values into a single column in Pandas dataframe

15 Answers15

Linked

Related