Is there a way to merge two tables in pyspark
- respect to a date, one presenting events linked to a date, and an other one presenting some other informations, presenting a period with a start and an end date ?
There is similar topics on python
, but non on pyspark
, like presented (using numpy
) in this answer. My idea would not to get only one information but the complete available information in my right table.
In this example, I would get in df1
, based on the id
, all available information in df2
for this id
, respecting the event_date
including in the start_period
and the end_period
.
df1 = spark.createDataFrame([
(1,'a', datetime.datetime(2021,1,1)),
(1,'b', datetime.datetime(2021,1,5)),
(1,'c', datetime.datetime(2021,1,24)),
(2,'d', datetime.datetime(2021,1,10)),
(2,'e' , datetime.datetime(2021,1,15))], ['id','event','event_date'])
df2 = spark.createDataFrame([
(1,'Xxz45','XX013', datetime.datetime(2021,1,1), datetime.datetime(2021,1,10)),
(1,'Xasz','XX014', datetime.datetime(2021,1,11), datetime.datetime(2021,1,22)),
(1,'Xbbd','XX015', datetime.datetime(2021,1,23), datetime.datetime(2021,1,26)),
(1,'Xaaq','XX016', datetime.datetime(2021,1,27), datetime.datetime(2021,1,31))], ['id','info1','info2','start_period', 'end_period'])
[EDIT] The expected output would be (merging on id
and on the event_date
included in the period):
df_results = spark.createDataFrame([
(1, 'a', datetime.datetime(2021,1,1),'Xxz45','XX013'),
(1, 'b', datetime.datetime(2021,1,5),'Xxz45','XX013'),
(1, 'c', datetime.datetime(2021,1,24),'Xbbd','XX015'),
(2, 'd', datetime.datetime(2021,1,10), NA, NA),
(2, 'e' , datetime.datetime(2021,1,15), NA, NA)], ['id','event','event_date','info1','info2'])