0

I want to assign in a new columns called 'new_col' a csv like string of other columns'values.

Currently I do as follows :

df['new_col'] = (df['a'].map(str) + ',' + df['b'].map(str))

This works perfectly fine but i want it to be autonomous. I want to feed the function a list of columns, and let it do the string.

Of course I could loop through the list as follows :

lstColumns = ['a','b']
lstItems = []

for item in lstColumns:
    lstItems.append(df[item])
szChain = (',').join(lstItems)

But that's quite ugly, and I might get to use it on dataframes with more columns.

So is there any way to simplify this ?

  • Possible duplicate of [Merge multiple column values into one column in python pandas](https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas) – Anshul Jindal May 24 '19 at 13:33

4 Answers4

0

You can use something like this :

df['new_col'] = df[df.columns].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)
Anshul Jindal
  • 368
  • 2
  • 17
0

Apply a function row-wise (axis=1) to the dataframe. The function maps to a string and joins with ", "

cols = ["a", "b"]
df.apply(lambda x: ", ".join(map(str, x[cols])), axis=1)                                                                                        
Will
  • 1,532
  • 10
  • 22
0

You can use the version proposed by @Anshul Jindal, but there is also another alternative, which significantly differs in the output and you may find it useful if you have nans in your data.

import io

df = pd.DataFrame({'a': ['a', 'b', np.nan], 
                   'b': [np.nan, 'e', 'f'], 
                   'c': ['g', 'h', 'i'], 
                   'd': ['j', np.nan, 'l']})
cols = ['a', 'b' ,'d']

# another approach, using temporary text buffer
with io.StringIO() as output:
    df[cols].to_csv(output, sep=',', index=False, header=False)
    output.seek(0)
    df = df.assign(new_col=output.readlines())

df.new_col = df.new_col.str.strip()

# approach proposed earlier
df = df.assign(new_col_2 = df[cols].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
))

print(df)
     a    b  c    d new_col new_col_2
0    a  NaN  g    j    a,,j       a,j
1    b    e  h  NaN    b,e,       b,e
2  NaN    f  i    l    ,f,l       f,l

Plus quite surprising timing of the approaches:

import io
import timeit

df = pd.DataFrame({'a': ['a', 'b', np.nan], 
                   'b': [np.nan, 'e', 'f'], 
                   'c': ['g', 'h', 'i'], 
                   'd': ['j', np.nan, 'l']})
cols = ['a', 'b' ,'d']

def buffer_approach(df, cols_to_merge):
    with io.StringIO() as output:
        df[cols_to_merge].to_csv(output, sep=',', index=False, header=False)
        output.seek(0)
        df = df.assign(new_col=output.readlines())

    df.new_col = df.new_col.str.strip()
    return df

def pandas_approach(df, cols_to_merge):
    df = df.assign(new_col = df[cols_to_merge].apply(
        lambda x: ','.join(x.dropna().astype(str)),
        axis=1
    ))
    return df

print(timeit.repeat("buffer_approach(df, cols)", globals=globals(), repeat=5, number=1000))
print(timeit.repeat("pandas_approach(df, cols)", globals=globals(), repeat=5, number=1000))

[2.5745794447138906, 2.556944037321955, 2.5482078031636775, 2.2512022089213133, 2.0038619451224804]
[3.6452969149686396, 3.326099018100649, 3.5136850751005113, 3.9479835461825132, 3.4149401267059147]
Garrus990
  • 88
  • 7
-1

Maybe I didn't understand your question correctly, but if you have a lot of columns you could do this:

cols_a = ['a1', 'a2', 'a3']
cols_b = ['b1', 'b2', 'b3']
cols_res = ['res1', 'res2', 'res3']

df = pd.DataFrame({i:[i, i] for i in (cols_a+cols_b+ cols_res)})

print(df)
   a1  a2  a3  b1  b2  b3  res1  res2  res3
0  a1  a2  a3  b1  b2  b3  res1  res2  res3
1  a1  a2  a3  b1  b2  b3  res1  res2  res3

df[cols_res] = (df[cols_a].astype(str).values + ',' + df[cols_b].astype(str).values)

print(df)
   a1  a2  a3  b1  b2  b3   res1   res2   res3
0  a1  a2  a3  b1  b2  b3  a1,b1  a2,b2  a3,b3
1  a1  a2  a3  b1  b2  b3  a1,b1  a2,b2  a3,b3


AT_asks
  • 132
  • 4
  • Please, could you tell why? (Or give a link) Vectorize operations in pandas is much faster than apply(). – AT_asks May 24 '19 at 13:37
  • why are you duplicating values in the columns res1, res2, res3 ? – Anshul Jindal May 24 '19 at 13:42
  • As I understood this line 'But that's quite ugly, and I might get to use it on dataframes with more columns.' there is a possibility that there are several columns that needed to be joined pairwise without looping through them.(updated solution to make it clearer) – AT_asks May 24 '19 at 13:48
  • Sorry, I didn't explain what I meant the right way. The exemple has 2idifferent columns. But I might have to do that with dataframes that have way more different columns. – Gregoire Rouet May 24 '19 at 14:14