I have an ETL script using Pandas, and to make it more scalable I am trying to recreate it with Pyspark. Got everything going so far, but having issues with a particular transformation to a daily dataset. I have one record per ID with start date and end date
id age state start_date end_date
123 18 CA 2/17/2019 5/4/2019
223 24 AZ 1/17/2019 3/4/2019
I want to create a record for each day between the start and end day, so I can join daily activity data to it. The target output would look something like this
id age state start_date
123 18 CA 2/17/2019
123 18 CA 2/18/2019
123 18 CA 2/19/2019
123 18 CA 2/20/2019
123 18 CA 2/21/2019
…
123 18 CA 5/2/2019
123 18 CA 5/3/2019
123 18 CA 5/4/2019
And of course do this for all ids and their respective start dates in the dataset. I was able to do this in Pandas using the following approach
melt = df.melt(id_vars=['id', 'age', 'state'], value_name='date').drop('variable', axis=1)
melt['date'] = pd.to_datetime(melt['date'])
melt = melt.groupby('id').apply(lambda x: x.set_index('date').resample('d').first())\
.ffill()\
.reset_index(level=1)\
.reset_index(drop=True)
But I am fairly new to Pyspark (and was struggling with this in Pandas) so I'm stuck here. Any help is much appreciated - thanks!