2

I have a dask dataframe as below:

   Column1  Column2  Column3  Column4  Column5
 0    a        1        2        3        4
 1    a        3        4        5
 2    b        6        7        8
 3    c        7        7        

I want to merge all of the columns into a single one efficiently. And I want each row to be a single string. Like below:

   Merged_Column
 0    a,1,2,3,4   
 1    a,3,4,5  
 2    b,6,7,8  
 3    c,7,7,7        

I've seen this question but it doesn't seem efficient since it is using the apply function. How can I achieve this as efficient as possible? (Speed + memory usage) Or is apply isn't as problematic as I believe since this is dask, not pandas.

This is what I tried. It seems like it is working but I am worried about the speed of it with the big dataframe.

cols= df.columns
df['combined'] = df[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1, meta=('str'))
df = df.drop(cols, axis=1)

I also need to get rid of the column header.

rpanai
  • 12,515
  • 2
  • 42
  • 64
MehmedB
  • 1,059
  • 1
  • 16
  • 42

2 Answers2

3

When you have to join string @saravanan saminathan methods wins hands down. Here there are some timing with dask

import dask.dataframe as dd
import numpy as np
import pandas as pd

N = int(1e6)

df = pd.DataFrame(np.random.randint(0,100,[N,10]))
df = dd.from_pandas(df, npartitions=4)
df = df.astype("str")
df_bk = df.copy()

Apply

%%time
df["comb"] = df.apply(lambda x:",".join(x), axis=1,meta=("str"))
df = df.compute()

CPU times: user 44.4 s, sys: 925 ms, total: 45.3 s
Wall time: 44.6 s

Add (explicit)

df = df_bk.copy()

%%time
df["comb"] = df[0]+","+df[1]+","+df[2]+","+df[3]+","+df[4]+","+\
             df[5]+","+df[6]+","+df[7]+","+df[8]+","+df[9]

df = df.compute()

CPU times: user 8.95 s, sys: 860 ms, total: 9.81 s
Wall time: 9.56 s

Add (loop)

In case you have many columns and you don't want to write down all of them

df = df_bk.copy()

%%time
df["comb"] = ''
for col in df.columns:
    df["comb"]+=df[col]+","

df = df.compute()

CPU times: user 11.6 s, sys: 1.32 s, total: 12.9 s
Wall time: 12.3 s
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • The code snippet under the title 'Add(loop)' is returning 'nan's in the df["comb"] column. Are you sure this is working? – MehmedB Dec 27 '19 at 16:33
  • I just checked what is inside of df right before the loop and it contains values other than nans. – MehmedB Dec 27 '19 at 16:37
2

Lambda function will take each row and so it may affect the speed of the process

So If we apply the operation by columns wise it will be fast

df = df.astype(str)
df["Column1"]+","+df["Column2"]+","+df["Column3"]+","+df["Column4"]+","+df["Column5"]
rpanai
  • 12,515
  • 2
  • 42
  • 64