1

I have a large dataframe (~1 million rows) with 20 string columns that I'm trying to concatenate into a single column with a separator, dropping NA values on the way. (Each row has a variable number of valid entries and NA values.)

Based on the solution here, I can get the output I need using df.apply but it is very slow:

raw['combined'] = raw.loc[:, 'record_1':'record_20'].apply(lambda x: '|'.join(x.dropna().values), axis=1)

Is there a faster way to do this concatenation or am I stuck with df.apply?

TY Lim
  • 509
  • 1
  • 3
  • 11
  • It is very difficult to answer your question without seeing both the data and the code produces your problem. Please read about how to ask a good question and try to post a [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example "Minimal Reproducible Example") so we can better help you. – itprorh66 Jun 18 '22 at 14:43
  • Check out: [Concatenate column values in a pandas DataFrame while ignoring NaNs](https://stackoverflow.com/questions/54468506/concatenate-column-values-in-a-pandas-dataframe-while-ignoring-nans) in particular solution by rafaelc. – DarrylG Jun 18 '22 at 14:50
  • What format are we talking about? CSV? Can the delimiter be part of the content? Is there anything else in the file except these 20 columns? I think if it is CSV, then the best way would be to not load it as DataFrame but work in the file text per line directly. Something like `with open('filename') as f: lines = [line.rstrip() for line in f]`, merge that in the required way by replacing the delimiter with `|` and write to an output file directly, so you don't keep anything in memory. – ewz93 Jun 18 '22 at 15:20

1 Answers1

0

agg is much faster than apply. Use agg instead.

import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime


df = pd.DataFrame({
    'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
    'A' : [1,  np.nan,4, 7],
    'B' : [2, np.nan, 5, 8],
    'C' : [3, 6, 9, np.nan]
}).set_index('date')

print(df)
"""
              A    B    C
date                     
05/9/2023   1.0  2.0  3.0
07/10/2023  NaN  NaN  6.0
08/11/2023  4.0  5.0  9.0
06/12/2023  7.0  8.0  NaN
"""
start_time = datetime.now()
df['Without_NAN'] = df[df.columns].agg(lambda x: ','.join(x.dropna().astype(str)),axis=1)
print(df)

"""

              A    B    C  Without_NAN
date                                  
05/9/2023   1.0  2.0  3.0  1.0,2.0,3.0
07/10/2023  NaN  NaN  6.0          6.0
08/11/2023  4.0  5.0  9.0  4.0,5.0,9.0
06/12/2023  7.0  8.0  NaN      7.0,8.0

"""
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
Soudipta Dutta
  • 1,353
  • 1
  • 12
  • 7