0

I am using the function segmentMatch in which I am sending two dataframes. I am using a for loop through one dataframe and have some condition to check before I merge with another dataframe with the loop variable. It gives me perfect answer but because both dataframes were too big, it is too slow.

Is there any way I can improve the speed.

def segmentMatch(self, df, df_program):

    df_result = []
    for i, rview in df.iterrows():
        df_tmp = []
        df1 = []
        df_tmp = df_program.ix[(df_program.iD == rview['id']) & 
                                (rview['end_time'] >= df_program.START_TIME) &
                                (rview['start_time'] <= df_program.END_TIME)]
        df1 = rview.to_frame().transpose()
        tmp = pd.merge(df1, df_tmp,how='left')
        df_result.append(tmp)


    result = pd.concat(df_result, axis=0)
    del(df1, df_tmp, tmp)
    return result

Please help me. I am using Visual studio code and Python 3.6

Thanks in advance.

Roelant
  • 4,508
  • 1
  • 32
  • 62
pri
  • 31
  • 3

1 Answers1

1

In general the advise is to never loop through a dataframe if it can be avoided. Looping is super slow compare to any merge or join.

Conditional joins are not great in pandas. They are pretty easy in SQL however. A small lifehack could be to pip install pandasql and actually use SQL. See also here. The example below is not tested.

import pandasql as ps

sqlcode = '''
SELECT *
FROM df
JOIN df ON 1=1 
    AND df_program.iD = df.id 
    AND df.end_time >= df_program.START_TIME
    AND df.start_time <= df_program.END_TIME
'''

new_df = ps.sqldf(sqlcode, locals())

If you prefer to not use pandassqlI would suggest just merging and checking the conditions later. That of course requires a bit more memory, depending on the overlap in IDs. Again, bit tricky without data, but something along the lines of

full_df = df.join(df, on='id', rsuffix='program_')
filtered_df = df.loc[(df.end_time >= df.program_START_TIME) & (df.start_time <= df.program_END_TIME)

If it doesn't fit in memory, you could try to do the same with a dask dataframe.

import dask.dataframe as dd

# Read your CSVs in like this
df = dd.read_csv('')
df_program = dd.read_csv('')

# Now make sure your ids are index in both dataframes

# Join and filter like above 
full_df = df.join(df, on='id', rsuffix='program_')
filtered_df = df.loc[(df.end_time >= df.program_START_TIME) & (df.start_time <= df.program_END_TIME)

# Write the result to a CSV or convert to pandas (if it fits your memory anyway):
df = full_df.compute()
Roelant
  • 4,508
  • 1
  • 32
  • 62
  • Thanks for your reply. I will try with SQL. Is there any other way to improve my code without using SQL. – pri Feb 27 '19 at 13:00
  • Added a suggestion - think you can just join on the equality and filter as a second step :) – Roelant Feb 27 '19 at 13:11
  • Thanks again. I have tried use join(), because it is filtering after join, it removes all the data where condition doesn't match. I want all the data from df + match data from df_program. Could you please help again. – pri Feb 27 '19 at 16:45
  • Also, when I try to join all the records it gives me memory error as too big dataframes to join – pri Feb 27 '19 at 16:52
  • Added another option :) – Roelant Feb 28 '19 at 12:52