5

I have a specific series of datasets which come in the following general form:

import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})

The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30. My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):

df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3

enter image description here

So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:

Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:

df['result'] = ' '.join(df.ix[:,1:])

Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings: Concatenate row-wise across specific columns of dataframe

EmEs
  • 167
  • 1
  • 3
  • 16
  • 3
    Try this: http://stackoverflow.com/a/32529152/5276797 – IanS Sep 19 '16 at 11:19
  • 1
    Perfect! The solution suggested by Russ is flexible and simple. Thanks @IanS for pointing out that answer! – EmEs Sep 19 '16 at 11:39
  • 1
    For two columns, this answer [https://stackoverflow.com/a/36911306/3903767] on the same question linked by @IanS is substantially faster using .str.cat – Johannes May 16 '18 at 09:44

3 Answers3

2

The key to operate in columns (Series) of strings en mass is the Series.str accessor.

I can think of two .str methods to do what you want.

str.cat()

The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:

column_names = df.columns[1:]  # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')

Or, in one line:

df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')

str.join()

The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:

tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')

Or, in one line:

df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')

BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.

LeoRochael
  • 14,191
  • 6
  • 32
  • 38
2

Other than using apply to concatenate the strings, you can also use agg to do so.

df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]: 
0    a d g i
1    b e h j
2    c f i k
dtype: object
1

Here is a slightly alternative solution:

In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()

In [58]: df
Out[58]:
     n t0 t1 t2 t3   result
0   92  a  d  g  i  a d g i
1  916  b  e  h  j  b e h j
2  363  c  f  i  k  c f i k
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419