I have edited the question based on @user17242583 comments and added more data and the output I get using just the example data.
I have two csv-files that I have read into two DataFrames, df1 and df2.
df1 (11 rows):
ID DateTime Event_no
0 1 01/01/2019 0:02 1003
1 2 01/01/2019 0:28 1015
2 3 01/01/2019 1:43 1029
3 4 01/01/2019 2:12 1042
4 5 01/01/2019 2:58 1055
5 6 01/01/2019 11:02 1068
6 7 01/01/2019 12:02 1081
7 8 01/01/2019 13:46 1094
8 9 01/01/2019 21:02 1107
9 10 01/01/2019 21:32 1120
10 11 01/01/2019 23:37 1133
df2 (14 rows):
ID lat lon DateTime
0 15 50.7823 90.564000 01/01/2019 0:03
1 16 51.4852 90.473600 01/01/2019 0:29
2 17 50.2981 90.387600 01/01/2019 3:53
3 18 50.3710 90.298667 01/01/2019 4:03
4 19 50.1289 90.210467 01/01/2019 5:03
5 20 49.8868 90.122267 01/01/2019 11:03
6 21 49.6447 90.034067 01/01/2019 13:03
7 22 49.4026 89.945867 01/01/2019 15:03
8 23 49.1605 89.857667 01/01/2019 19:03
9 24 48.9184 89.769467 01/01/2019 21:03
10 25 48.6763 89.681267 01/01/2019 22:03
11 26 48.4342 89.593067 01/01/2019 22:23
12 27 48.1921 89.504867 01/01/2019 23:03
13 28 47.9500 89.416667 01/01/2019 23:43
I need to join these two DataFrames based on the nearest date and time so that the joined Dataframe looks like this and meets these conditions:
df_join (11 rows)
- all the events need to be joined with one location
- one location can be joined with multiple events
- some locations don't have an event to join to:
ID lat lon DateTime Event_no
0 15 50.7823 90.564000 01/01/2019 0:03 1003
1 16 51.4852 90.473600 01/01/2019 0:29 1015
2 16 51.4852 90.473600 01/01/2019 0:29 1029
3 17 50.2981 90.387600 01/01/2019 3:53 1042
4 17 50.2981 90.387600 01/01/2019 3:53 1055
5 20 49.8868 90.122267 01/01/2019 11:03 1068
6 20 49.8868 90.122267 01/01/2019 11:03 1081
7 21 49.6447 90.034067 01/01/2019 13:03 1094
8 24 48.9184 89.769467 01/01/2019 21:03 1107
9 25 48.6763 89.681267 01/01/2019 22:03 1120
10 28 47.9500 89.416667 01/01/2019 23:43 1133
Following @jezrael's answer here I've written the following code:
import pandas as pd
df1 = pd.read_csv("path/filename1.csv")
df2 = pd.read_csv("path/filename2.csv")
df1['DateTime'] = pd.to_datetime(df1.DateTime)
df2['DateTime'] = pd.to_datetime(df2.DateTime)
df1.sort_values('DateTime', inplace=True)
df2.sort_values('DateTime', inplace=True)
df1_join = df1.set_index('DateTime').reindex(df2.set_index('DateTime').index, method='nearest').reset_index()
df1_merge = (pd.merge(df2, df1_join, on='DateTime'))
df1_merge.to_csv("path/filename_join.csv"")
The code runs through just fine but doesn't give me the results I need.
print(df1_join)
DateTime ID Event_no
0 2019-01-01 00:03:00 1 1003
1 2019-01-01 00:29:00 2 1015
2 2019-01-01 03:53:00 5 1055
3 2019-01-01 04:03:00 5 1055
4 2019-01-01 05:03:00 5 1055
5 2019-01-01 11:03:00 6 1068
6 2019-01-01 13:03:00 8 1094
7 2019-01-01 15:03:00 8 1094
8 2019-01-01 19:03:00 9 1107
9 2019-01-01 21:03:00 9 1107
10 2019-01-01 22:03:00 10 1120
11 2019-01-01 22:23:00 10 1120
12 2019-01-01 23:03:00 11 1133
13 2019-01-01 23:43:00 11 1133
- It doesn’t join each event to one location (events 1029, 1042 and 1081 missing)
- The code allows one event to be joined with multiple locations (1055, 1094, 1107, 1120 and 1133)
Any advice on how to edit the code so that the previously conditions are met?