0

So I basically have 2 csv files:

1) Stamps.csv (800 rows)

[id (int64), start date (datetime64[ns]), end
date (datetime64[ns])] 

2) Data.csv (360 000 rows)[date time(datetime64[ns])]

All entries in Stamps.csv have unique start date and end date intervals. I loaded them in 2 dataframes

df=pd.read_csv(Stamps.csv) 
df2=pd.read_csv(Data.csv)

and trying to match the entries from df2 to entries from df, based on their date time fields; such that if the "date time" value from df2 is between an interval, then give it the id from df. Right now im trying to do that like this:

for index2, row2 in df2.iterrows(): #for every entry in the Data.csv
  for index, row in df.iterrows(): #for every entry in Stamp.csv
    if(row['StartDateTime']<=row2['Datetime']<=row['StopDateTime']): #if the date from Data.csv is between 2 intervals from Stamp.csv
        df2['JobRef']=row['JobRef'] #give the id from Stamp.csv to the entry from Data.csv
        break
df.to_csv(path_or_buf=path, sep=';', index=False)

I got the wanted result, but it takes a really long time to process (first time it took 3 hours to run). Is there any way of doing this faster?

P.S. I also tried multithreading the for loops, but it still took a long time executing.

amanb
  • 5,276
  • 3
  • 19
  • 38
RRG
  • 27
  • 5
  • don't iterate over the large data frame, use vectorized comparisons. This should speed up things significantly. – Quickbeam2k1 Apr 04 '18 at 18:47
  • 1
    I think you can using https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range – BENY Apr 04 '18 at 18:48

1 Answers1

0

Ended up doing this, thanks to chris dorn for posting the answer :

import pandasql as ps

sqlcode = '''
select df2.*, df.JobRef
from df
inner join df2 
on df2.Datetime between df.StartDateTime and df.StopDateTime
'''
df2 = ps.sqldf(sqlcode,locals())

works pretty fast.

RRG
  • 27
  • 5