So I basically have 2 csv files:
1) Stamps.csv (800 rows)
[id (int64), start date (datetime64[ns]), end
date (datetime64[ns])]
2) Data.csv (360 000 rows)[date time(datetime64[ns])]
All entries in Stamps.csv have unique start date and end date intervals. I loaded them in 2 dataframes
df=pd.read_csv(Stamps.csv)
df2=pd.read_csv(Data.csv)
and trying to match the entries from df2 to entries from df, based on their date time fields; such that if the "date time" value from df2 is between an interval, then give it the id from df. Right now im trying to do that like this:
for index2, row2 in df2.iterrows(): #for every entry in the Data.csv
for index, row in df.iterrows(): #for every entry in Stamp.csv
if(row['StartDateTime']<=row2['Datetime']<=row['StopDateTime']): #if the date from Data.csv is between 2 intervals from Stamp.csv
df2['JobRef']=row['JobRef'] #give the id from Stamp.csv to the entry from Data.csv
break
df.to_csv(path_or_buf=path, sep=';', index=False)
I got the wanted result, but it takes a really long time to process (first time it took 3 hours to run). Is there any way of doing this faster?
P.S. I also tried multithreading the for loops, but it still took a long time executing.