1

I have dataframe in pyspark as below

ID  Name    add date from   date end 
1   aaa yyyyyy  20-01-2018  30-01-2018
2   bbb ffffff  02-11-2018  15-11-2018

but looking to get ouput as below

ID  Name    add date from   date end 
1   aaa yyyyyy  20-01-2018  30-01-2018
1   aaa yyyyyy  21-01-2018  30-01-2018
1   aaa yyyyyy  22-01-2018  30-01-2018
1   aaa yyyyyy  23-01-2018  30-01-2018
1   aaa yyyyyy  24-01-2018  30-01-2018
1   aaa yyyyyy  25-01-2018  30-01-2018
1   aaa yyyyyy  26-01-2018  30-01-2018
1   aaa yyyyyy  27-01-2018  30-01-2018
1   aaa yyyyyy  28-01-2018  30-01-2018
1   aaa yyyyyy  29-01-2018  30-01-2018
1   aaa yyyyyy  30-01-2018  30-01-2018
2   bbb ffffff  02-11-2018  15-11-2018
2   bbb ffffff  03-11-2018  15-11-2018
2   bbb ffffff  04-11-2018  15-11-2018
2   bbb ffffff  05-11-2018  15-11-2018
2   bbb ffffff  06-11-2018  15-11-2018
2   bbb ffffff  07-11-2018  15-11-2018
2   bbb ffffff  08-11-2018  15-11-2018
2   bbb ffffff  09-11-2018  15-11-2018
2   bbb ffffff  10-11-2018  15-11-2018
2   bbb ffffff  11-11-2018  15-11-2018
2   bbb ffffff  12-11-2018  15-11-2018
2   bbb ffffff  13-11-2018  15-11-2018
2   bbb ffffff  14-11-2018  15-11-2018
2   bbb ffffff  15-11-2018  15-11-2018
Psidom
  • 209,562
  • 33
  • 339
  • 356
user2935539
  • 73
  • 2
  • 6
  • Check this: https://stackoverflow.com/questions/43141671/sparksql-on-pyspark-how-to-generate-time-series – Manrique Dec 15 '18 at 01:13

2 Answers2

4

Try this out:

a = [(1,'aaa','yyyyyy','20-01-2018','30-01-2018'),
     (2,'bbb','ffffff','02-11-2018','15-11-2018')]
df = spark.createDataFrame(a,["ID","Name","add","date_from","date_end"])

df.show()

+---+----+------+----------+----------+
| ID|Name|   add| date_from|  date_end|
+---+----+------+----------+----------+
|  1| aaa|yyyyyy|20-01-2018|30-01-2018|
|  2| bbb|ffffff|02-11-2018|15-11-2018|
+---+----+------+----------+----------+

df.registerTempTable("temp")

result = sqlContext.sql("""
    select t.ID, 
        t.Name, 
        t.add, 
        date_format(date_add(to_date(t.date_from,'dd-MM-yyyy'),pe.i),'dd-MM-yyyy') as date_from, 
        t.date_end 
    from temp t 
    lateral view posexplode(split(space(datediff(to_date(t.date_end,'dd-MM-yyyy'),to_date(t.date_from,'dd-MM-yyyy'))),' ')) pe as i,x
    """)

result.show()

+---+----+------+----------+----------+
| ID|Name|   add| date_from|  date_end|
+---+----+------+----------+----------+
|  1| aaa|yyyyyy|20-01-2018|30-01-2018|
|  1| aaa|yyyyyy|21-01-2018|30-01-2018|
|  1| aaa|yyyyyy|22-01-2018|30-01-2018|
|  1| aaa|yyyyyy|23-01-2018|30-01-2018|
|  1| aaa|yyyyyy|24-01-2018|30-01-2018|
|  1| aaa|yyyyyy|25-01-2018|30-01-2018|
|  1| aaa|yyyyyy|26-01-2018|30-01-2018|
|  1| aaa|yyyyyy|27-01-2018|30-01-2018|
|  1| aaa|yyyyyy|28-01-2018|30-01-2018|
|  1| aaa|yyyyyy|29-01-2018|30-01-2018|
|  1| aaa|yyyyyy|30-01-2018|30-01-2018|
|  2| bbb|ffffff|02-11-2018|15-11-2018|
|  2| bbb|ffffff|03-11-2018|15-11-2018|
|  2| bbb|ffffff|04-11-2018|15-11-2018|
|  2| bbb|ffffff|05-11-2018|15-11-2018|
|  2| bbb|ffffff|06-11-2018|15-11-2018|
|  2| bbb|ffffff|07-11-2018|15-11-2018|
|  2| bbb|ffffff|08-11-2018|15-11-2018|
|  2| bbb|ffffff|09-11-2018|15-11-2018|
|  2| bbb|ffffff|10-11-2018|15-11-2018|
+---+----+------+----------+----------+

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Marouane Lakhal
  • 774
  • 1
  • 15
  • 42
0

To perform same in Azure data factory Dataflow, follow below logic.

  1. Add a derived column activity in dataflow, This will calculate number of months between start and end date

    derived_col1=toInteger(round(monthsBetween(end_date,st_date)))
    
  2. add another derived column activity, this will return array having all the months between 2 dates

    derived_col2= mapLoop(derived_col1+1,toDate(addMonths(st_date, (toInteger(#index)-1))))
    
  3. another derived activity, this is similar to explode operation which will return you multiple rows

    month_column=unfold(derived_col2)
    

All this can be fit into 1 derived column activity but just to explain the logic to you, writing in multiple steps

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Umesh Pawar
  • 3
  • 1
  • 3