5

I am wondering if I could build such a module in Pandas:

    def concatenate(df,columnlist,newcolumn):
        # df is the dataframe and
        # columnlist is the list contains the column names of all the columns I want to concatnate
        # newcolumn is the name of the resulted new column

        for c in columnlist:
            ...some Pandas functions

        return df # this one has the concatenated "newcolumn"

I am asking this because that len(columnlist) is going to be very big and dynamic. Thanks!

LarryZ
  • 107
  • 2
  • 11
  • 1
    `I am wondering if...` - You will never know until you try. – wwii Nov 25 '17 at 00:59
  • Does this answer your question? [Combine two columns of text in dataframe in pandas/python](https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python) – jmuhlenkamp Dec 13 '19 at 21:57

2 Answers2

10

Try this:

import numpy as np
np.add.reduce(df[columnlist], axis=1)

What this does is to "add" the values in each row, which for strings means to concatenate them ("abc" + "de" == "abcde").


Originally I thought you wanted to concatenate them lengthwise, into a single longer series of all the values. If anyone else wants to do that, here's the code:

pd.concat(map(df.get, columnlist)).reset_index(drop=True)
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Thanks John! I guess you misunderstood my original request @John Zwinck: if Column A is "ABC" and Column B is "XYZ" my newcolumn should be "ABCXYZ". The newcolumn has the exact length of the dataframe. – LarryZ Nov 25 '17 at 01:17
  • @LarryZ: I see. I've changed my answer. – John Zwinck Nov 25 '17 at 01:51
  • Thanks, @John Zwinck. It worked! It seems this method requires all the columns to be str, when any column contains int or float it will give the following error: " TypeError: must be str, not float " – LarryZ Nov 27 '17 at 17:37
  • @LarryZ: You can fix that by `np.add.reduce(df[columnlist].astype(str), axis=1)`. – John Zwinck Nov 28 '17 at 02:59
  • 1
    Thanks, man! This is the answer, period! A shameless followup question: What if I also want to add a "separator" between columns? i.e. instead of "ABCXYZ" I want "ABC XYZ"? A dumb way is to add a new column called "Space" - contains nothing but one space " ", then insert the column name "Sapace" to my columnlist where necessary, it worked fine. Is there a more Pythonic way to do this? – LarryZ Nov 28 '17 at 17:09
  • @LarryZ: I think that's a fine solution (using a Series with the separator). – John Zwinck Nov 29 '17 at 00:48
8

Given a dataframe like this:

df

     A    B
0  aaa  ddd
1  bbb  eee
2  ccc  fff

You can just use df.sum, given every column is a string column:

df.sum(1)

0    aaaddd
1    bbbeee
2    cccfff
dtype: object

If you need to perform a conversion, you can do so:

df.astype(str).sum(1)

If you need to select a subset of your data (only string columns?), you can use select_dtypes:

df.select_dtypes(include=['str']).sum(1)

If you need to select by columns, this should do:

df[['A', 'B']].sum(1)

In every case, the addition is not inplace, so if you want to persist your result, please assign it back:

r = df.sum(1)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks, @COLDSPEED. Your solution appears promising. I tried "df.select_dtypes(include=['str']).sum(1)" but get this error below: File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2369, in select_dtypes invalidate_string_dtypes(dtypes) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 497, in invalidate_string_dtypes raise TypeError("string dtypes are not allowed, use 'object' instead") TypeError: string dtypes are not allowed, use 'object' instead – LarryZ Nov 27 '17 at 17:50
  • Then when I change the code to df.select_dtypes(include=['object']).sum(1), it gave no error but the result is one column with all "0". Any idea why? Thanks! – LarryZ Nov 27 '17 at 17:52
  • @LarryZ what are your column types initially? – cs95 Nov 27 '17 at 19:48
  • @COLDSPEED Thanks for the followup. A number of the columns contains mixed data type, both str and int. These columns are labeled as "object" – LarryZ Nov 28 '17 at 17:23
  • @LarryZ Select_dtypes may not work but everything else should. – cs95 Nov 28 '17 at 18:18