19

I am trying to concat multiple Pandas DataFrame columns with different tokens.

For example, my dataset looks like this :

dataframe = pd.DataFrame({'col_1' : ['aaa','bbb','ccc','ddd'], 
                          'col_2' : ['name_aaa','name_bbb','name_ccc','name_ddd'], 
                          'col_3' : ['job_aaa','job_bbb','job_ccc','job_ddd']})

I want to output something like this:

    features
0   aaa <0> name_aaa <1> job_aaa
1   bbb <0> name_bbb <1> job_bbb
2   ccc <0> name_ccc <1> job_ccc
3   ddd <0> name_ddd <1> job_ddd

Explanation :

concat each column with "<{}>" where {} will be increasing numbers.

What I've tried so far:

I don't want to modify original DataFrame so I created two new dataframe:

features_df = pd.DataFrame()
final_df    = pd.DataFrame()
for iters in range(len(dataframe.columns)):
    features_df[dataframe.columns[iters]] = dataframe[dataframe.columns[iters]] + ' ' + "<{}>".format(iters)
final_df['features'] = features_df[features_df.columns].agg(' '.join, axis=1)

There is an issue I am facing, It's adding <2> at last but I want output like above, also this is not panda's way to do this task, How I can make it more efficient?

Georgy
  • 12,464
  • 7
  • 65
  • 73
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88

4 Answers4

8
from itertools import chain

dataframe['features'] = dataframe.apply(lambda x: ''.join([*chain.from_iterable((v, f' <{i}> ') for i, v in enumerate(x))][:-1]), axis=1)

print(dataframe)

Prints:

  col_1     col_2    col_3                      features
0   aaa  name_aaa  job_aaa  aaa <0> name_aaa <1> job_aaa
1   bbb  name_bbb  job_bbb  bbb <0> name_bbb <1> job_bbb
2   ccc  name_ccc  job_ccc  ccc <0> name_ccc <1> job_ccc
3   ddd  name_ddd  job_ddd  ddd <0> name_ddd <1> job_ddd
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
8

You can use df.agg to join the columns of the dataframe by passing the optional parameter axis=1. Use:

df['features'] = df.agg(
    lambda s: r' <{}> '.join(s).format(*range(s.size)), axis=1)

Output:

# print(df)
  col_1     col_2    col_3                      features
0   aaa  name_aaa  job_aaa  aaa <0> name_aaa <1> job_aaa
1   bbb  name_bbb  job_bbb  bbb <0> name_bbb <1> job_bbb
2   ccc  name_ccc  job_ccc  ccc <0> name_ccc <1> job_ccc
3   ddd  name_ddd  job_ddd  ddd <0> name_ddd <1> job_ddd
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 2
    That's clever solution. – Andrej Kesely May 24 '20 at 08:41
  • 1
    @ShubhamSharma Instead of using `len(s)` since `s` is a Series so use `s.size` which will be faster than `len` or use `s.values.size`. Nice answer.+1 ;) [`df.apply over axis 1`](https://stackoverflow.com/a/54433552/12416453) is not encouraged I guess `df.agg` is the way. – Ch3steR May 24 '20 at 09:13
  • Thanks @Ch3steR! Don't know if there is any benefit from using `s.size` instead of `len(s)` but i guess according to this [post](https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe) `len(s.index)` and `s.size` are same in terms of speed. By the way thanks for suggestion. – Shubham Sharma May 24 '20 at 09:27
3
def join_(value):
    vals = []
    for i, j in enumerate(value):
        vals.append(j + " <%d>" % i if i < len(value) - 1 else j)
    return " ".join(vals)

# setting axis=1 will pass all columns to the join_ func.
dataframe['featurs'] = dataframe.apply(lambda x: join_(x), axis=1)

print(dataframe)

Output

  col_1     col_2    col_3                       featurs
0   aaa  name_aaa  job_aaa  aaa <0> name_aaa <1> job_aaa
1   bbb  name_bbb  job_bbb  bbb <0> name_bbb <1> job_bbb
2   ccc  name_ccc  job_ccc  ccc <0> name_ccc <1> job_ccc
3   ddd  name_ddd  job_ddd  ddd <0> name_ddd <1> job_ddd
sushanth
  • 8,275
  • 3
  • 17
  • 28
3
df['features'] = [" ".join(F"{entry}<{num}>" 
                  if ent[-1] != entry 
                  else entry 
                  for num, entry in enumerate(ent) )
                  for ent in df.to_numpy()]



   col_1   col_2      col_3         features
0   aaa   name_aaa  job_aaa aaa<0> name_aaa<1> job_aaa
1   bbb   name_bbb  job_bbb bbb<0> name_bbb<1> job_bbb
2   ccc   name_ccc  job_ccc ccc<0> name_ccc<1> job_ccc
3   ddd   name_ddd  job_ddd ddd<0> name_ddd<1> job_ddd
sammywemmy
  • 27,093
  • 4
  • 17
  • 31