25

I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this

num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
    print('Concatenating', col_names[each_col_ind])
    df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]
Clock Slave
  • 7,627
  • 15
  • 68
  • 109

6 Answers6

43

Solution with sum, but output is float, so convert to int and str is necessary:

df['new'] = df.sum(axis=1).astype(int).astype(str)

Another solution with apply function join, but it the slowiest:

df['new'] = df.apply(''.join, axis=1)

Last very fast numpy solution - convert to numpy array and then 'sum':

df['new'] = df.values.sum(axis=1)

Timings:

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)

cols = list('ABC')

#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop

In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop

In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop

In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop

EDIT If dtypes of some columns are not object (obviously strings) cast by DataFrame.astype:

df['new'] = df.astype(str).values.sum(axis=1)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Is there a way to only concatenate the last 20 columns but only concatenate a particular column if there is data? Also, i would want to have a delimiter in place so when you're looking at your overall column, you can see how it's broken out. – Gary Dorman Jun 23 '19 at 15:43
24
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

Gives us:

df
Out[6]: 
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

To select a given set of columns:

df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')

df
Out[8]: 
   A  B  C concat
0  1  4  7     14
1  2  5  8     25
2  3  6  9     36

However, I've noticed that approach can sometimes result in NaNs being populated where they shouldn't, so here's another way:

>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

Although it should be noted that this approach is a lot slower:

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop

Versus

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop
blacksite
  • 12,086
  • 10
  • 64
  • 109
15

I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.

For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

Here is my modification for Python 3.6.3:

df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))
bodily11
  • 584
  • 6
  • 10
3

The solutions given above that use numpy arrays have worked great for me.

However, one thing to be careful about is the indexing when you get the numpy.ndarray from df.values, since the axis labels are removed from df.values.

So to take one of the solutions offered above (the one that I use most often) as an example:

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

This portion:

df.fillna('').values

does not preserve the indices of the original DataFrame. Not a problem when the DataFrame has the common 0, 1, 2, ... row indexing scheme, but this solution will not work when the DataFrame is indexed in any other way. You can fix this by adding an index= argument to pd.Series():

df['concat'] = pd.Series(df.fillna('').values.tolist(), 
                         index=df.index).str.join('')

I always add the index= argument just to be safe, even when I'm sure the DataFrame is row-indexed as 0, 1, 2, ...

1

This lambda approach offers some flexibility with columns chosen and separator type:

Setup:

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})

    A   B   C
0   1   4   7
1   2   5   8
2   3   6   9

Concatenate All Columns - no separator:

cols = ['A', 'B', 'C']
df['combined'] = df[cols].apply(lambda row: ''.join(row.values.astype(str)), axis=1)

    A   B   C   combined
0   1   4   7   147
1   2   5   8   258
2   3   6   9   369

Concatenate Two Columns A and C with '_' separator:

cols = ['A', 'C']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

    A   B   C   combined
0   1   4   7   1_7
1   2   5   8   2_8
2   3   6   9   3_9
Grant Shannon
  • 4,709
  • 1
  • 46
  • 36
0

as a solution to @Gary Dorman's question in the comment,
i would want to have a delimiter in place so when you're looking at your overall column, you can see how it's broken out.

you maybe use

df_tmp=df.astype(str) + ','
df_tmp.sum(axis=1).str.rstrip(',')

before:

1.2.3.480tcp
6.6.6.680udp
7.7.7.78080tcp
8.8.8.88080tcp
9.9.9.98080tcp

after:

1.2.3.4,80,tcp
6.6.6.6,80,udp
7.7.7.7,8080,tcp
8.8.8.8,8080,tcp
9.9.9.9,8080,tcp

which looks better (like CSV :) This additional sep step is about 30% slower in my machine.

Osadhi Virochana
  • 1,294
  • 2
  • 11
  • 21
ap_
  • 1
  • 5