0

I have two columns A(year1) and B(year2) in spark. I need to create a column C which has to contain an array of years between year 1 and year 2 .
suppose A - 1990 & B - 1993 o/p C - should be [1990,1990,1991,1991,1992,1992,1993,1993]

could anyone come up with a solution (spark) with out using udf

  • 1
    Why does your column C contain every year two times? Please also keep in mind that stackoverflow is not a code writting service and you should therefore always add what you have tried by yourself. – cronoik Jul 01 '19 at 10:40
  • df_policy = df_policy.withColumn("range", col(F.array(range(df_policy.startyr,df_policy.stopyr)))) – Jason Roy Jul 01 '19 at 10:44
  • Possible duplicate of [Filling gaps in timeseries Spark](https://stackoverflow.com/questions/42411184/filling-gaps-in-timeseries-spark) – pault Jul 01 '19 at 13:58

1 Answers1

0

You could try, assume df contains year1 and year2.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

years = spark.range(2020).withColumnRenamed('id', 'year')
df = (
    df
    .withColumn(
        ‘id’,
        F. monotonically_increasing_id()
        ) # EDIT: There was a missing bracket here
    .join(
        years,
        F.col(‘year’).between(‘year1’, ‘year2’),
    )
    .groupBy(
        ‘id’
    )
    .agg(
        F.collect_list(‘year’).alias(‘years’)
    )
)

Let me know it this doesn't work.

tpain
  • 144
  • 6
  • years = spark.range(2020).withColumnRenamed('id', 'year') df_policy = (df_policy.withColumn('id',F. monotonically_increasing_id().join(years,F.col('year').between('startyr', 'stopyr'),).groupBy('id').agg(F.collect_list('year').alias('years') ) )) I tried this one but I am getting this error TypeError: 'Column' object is not callable – Jason Roy Jul 03 '19 at 14:38
  • Ahh sorry, I’ve edited the answer there was a missing parenthesis after F.monotonically_increasing_id() – tpain Jul 03 '19 at 15:23
  • df_policy =(df_policy.withColumn('id',F.monotonically_increasing_id()).join(years,F.col('year').between(df_policy.startyr, df_policy.stopyr),).groupBy('id').agg(F.collect_list('year').alias('years'))) I tried this one but I m getting this error AnalysisException: u'Cannot resolve column name "stopyr" among (id, years) – Jason Roy Jul 08 '19 at 15:56