Merging Dataframes within loop - Pandas is slow

Question

I am using the function segmentMatch in which I am sending two dataframes. I am using a for loop through one dataframe and have some condition to check before I merge with another dataframe with the loop variable. It gives me perfect answer but because both dataframes were too big, it is too slow.

Is there any way I can improve the speed.

def segmentMatch(self, df, df_program):

    df_result = []
    for i, rview in df.iterrows():
        df_tmp = []
        df1 = []
        df_tmp = df_program.ix[(df_program.iD == rview['id']) & 
                                (rview['end_time'] >= df_program.START_TIME) &
                                (rview['start_time'] <= df_program.END_TIME)]
        df1 = rview.to_frame().transpose()
        tmp = pd.merge(df1, df_tmp,how='left')
        df_result.append(tmp)


    result = pd.concat(df_result, axis=0)
    del(df1, df_tmp, tmp)
    return result

Please help me. I am using Visual studio code and Python 3.6

Thanks in advance.

Can you add [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) ? E.g. for each DataFrame 5 rows with 3 columns? — jezrael, Feb 27 '19 at 12:25
df = 11 columns and more than 10,00,000 rows whereas df_program = 7 columns and 40000 rows — pri, Feb 27 '19 at 12:59

Roelant · Answer 1 · 2019-02-28T12:51:58.123

In general the advise is to never loop through a dataframe if it can be avoided. Looping is super slow compare to any merge or join.

Conditional joins are not great in pandas. They are pretty easy in SQL however. A small lifehack could be to pip install pandasql and actually use SQL. See also here. The example below is not tested.

import pandasql as ps

sqlcode = '''
SELECT *
FROM df
JOIN df ON 1=1 
    AND df_program.iD = df.id 
    AND df.end_time >= df_program.START_TIME
    AND df.start_time <= df_program.END_TIME
'''

new_df = ps.sqldf(sqlcode, locals())

If you prefer to not use pandassqlI would suggest just merging and checking the conditions later. That of course requires a bit more memory, depending on the overlap in IDs. Again, bit tricky without data, but something along the lines of

full_df = df.join(df, on='id', rsuffix='program_')
filtered_df = df.loc[(df.end_time >= df.program_START_TIME) & (df.start_time <= df.program_END_TIME)

If it doesn't fit in memory, you could try to do the same with a dask dataframe.

import dask.dataframe as dd

# Read your CSVs in like this
df = dd.read_csv('')
df_program = dd.read_csv('')

# Now make sure your ids are index in both dataframes

# Join and filter like above 
full_df = df.join(df, on='id', rsuffix='program_')
filtered_df = df.loc[(df.end_time >= df.program_START_TIME) & (df.start_time <= df.program_END_TIME)

# Write the result to a CSV or convert to pandas (if it fits your memory anyway):
df = full_df.compute()

Thanks for your reply. I will try with SQL. Is there any other way to improve my code without using SQL. — pri, Feb 27 '19 at 13:00
Added a suggestion - think you can just join on the equality and filter as a second step :) — Roelant, Feb 27 '19 at 13:11
Thanks again. I have tried use join(), because it is filtering after join, it removes all the data where condition doesn't match. I want all the data from df + match data from df_program. Could you please help again. — pri, Feb 27 '19 at 16:45
Also, when I try to join all the records it gives me memory error as too big dataframes to join — pri, Feb 27 '19 at 16:52

Merging Dataframes within loop - Pandas is slow

1 Answers1