Concatenate pandas string columns with separator for large dataframe

Question

I have a large dataframe (~1 million rows) with 20 string columns that I'm trying to concatenate into a single column with a separator, dropping NA values on the way. (Each row has a variable number of valid entries and NA values.)

Based on the solution here, I can get the output I need using df.apply but it is very slow:

raw['combined'] = raw.loc[:, 'record_1':'record_20'].apply(lambda x: '|'.join(x.dropna().values), axis=1)

Is there a faster way to do this concatenation or am I stuck with df.apply?

It is very difficult to answer your question without seeing both the data and the code produces your problem. Please read about how to ask a good question and try to post a [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example "Minimal Reproducible Example") so we can better help you. — itprorh66, Jun 18 '22 at 14:43
Check out: [Concatenate column values in a pandas DataFrame while ignoring NaNs](https://stackoverflow.com/questions/54468506/concatenate-column-values-in-a-pandas-dataframe-while-ignoring-nans) in particular solution by rafaelc. — DarrylG, Jun 18 '22 at 14:50
What format are we talking about? CSV? Can the delimiter be part of the content? Is there anything else in the file except these 20 columns? I think if it is CSV, then the best way would be to not load it as DataFrame but work in the file text per line directly. Something like `with open('filename') as f: lines = [line.rstrip() for line in f]`, merge that in the required way by replacing the delimiter with `|` and write to an output file directly, so you don't keep anything in memory. — ewz93, Jun 18 '22 at 15:20

Soudipta Dutta · Answer 1 · 2023-02-05T12:47:16.997

agg is much faster than apply. Use agg instead.

import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime


df = pd.DataFrame({
    'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
    'A' : [1,  np.nan,4, 7],
    'B' : [2, np.nan, 5, 8],
    'C' : [3, 6, 9, np.nan]
}).set_index('date')

print(df)
"""
              A    B    C
date                     
05/9/2023   1.0  2.0  3.0
07/10/2023  NaN  NaN  6.0
08/11/2023  4.0  5.0  9.0
06/12/2023  7.0  8.0  NaN
"""
start_time = datetime.now()
df['Without_NAN'] = df[df.columns].agg(lambda x: ','.join(x.dropna().astype(str)),axis=1)
print(df)

"""

              A    B    C  Without_NAN
date                                  
05/9/2023   1.0  2.0  3.0  1.0,2.0,3.0
07/10/2023  NaN  NaN  6.0          6.0
08/11/2023  4.0  5.0  9.0  4.0,5.0,9.0
06/12/2023  7.0  8.0  NaN      7.0,8.0

"""
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Concatenate pandas string columns with separator for large dataframe

1 Answers1