0

Let's say I have a dataframe with the following columns:

# id  | name  | 01-Jan-10 | 01-Feb-10 | ... | 01-Jan-11 | 01-Feb-11
# -----------------------------------------------------------------
# 1   | a001  |     0     |    32     | ... |     14    |    108
# 1   | a002  |    80     |     0     | ... |      0    |     92

I want to expand this into a table like this:

# id  | name  | Jan | Feb | ... | Year
# -----------------------------------
# 1   | a001  |   0 |  32 | ... | 2010
# 1   | a001  |  14 | 108 | ... | 2011
# 1   | a002  |  80 |   0 | ... | 2010
# 1   | a002  |   0 |  92 | ... | 2011

I'd like to split the dates into rows by year and capture the values per month.

In pyspark (python + spark) how might this be accomplished? I've been attempting to collect the df data to iterate over and extract each field to write to each row, but I wondered if there were a more clever spark function that would help with this. (new to spark)

Tibberzz
  • 541
  • 1
  • 10
  • 23

1 Answers1

3

First melt DataFrame (How to melt Spark DataFrame?):

df = spark.createDataFrame(
    [(1, "a001", 0, 32, 14, 108), (2, "a02", 80, 0, 0, 92)],
    ("id", "name", "01-Jan-10", "01-Feb-10", "01-Jan-11", "01-Feb-11")
)

df_long = melt(df, df.columns[:2], df.columns[2:])

# +---+----+---------+-----+
# | id|name| variable|value|
# +---+----+---------+-----+
# |  1|a001|01-Jan-10|    0|
# |  1|a001|01-Feb-10|   32|
# |  1|a001|01-Jan-11|   14|
# |  1|a001|01-Feb-11|  108|
# |  2| a02|01-Jan-10|   80|
# |  2| a02|01-Feb-10|    0|
# |  2| a02|01-Jan-11|    0|
# |  2| a02|01-Feb-11|   92|
# +---+----+---------+-----+

Next parse date and extract year and month:

from pyspark.sql.functions import to_date, date_format, year

date = to_date("variable", "dd-MMM-yy")

parsed = df_long.select(
    "id", "name", "value", 
    year(date).alias("year"), date_format(date, "MMM").alias("month")
)

# +---+----+-----+----+-----+
# | id|name|value|year|month|
# +---+----+-----+----+-----+
# |  1|a001|    0|2010|  Jan|
# |  1|a001|   32|2010|  Feb|
# |  1|a001|   14|2011|  Jan|
# |  1|a001|  108|2011|  Feb|
# |  2| a02|   80|2010|  Jan|
# |  2| a02|    0|2010|  Feb|
# |  2| a02|    0|2011|  Jan|
# |  2| a02|   92|2011|  Feb|
# +---+----+-----+----+-----+

Finally pivot (How to pivot Spark DataFrame?):

# Providing a list of levels is not required but will make the process faster
# months = [
#     "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
#     "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
# ]

months = ["Jan", "Feb"]

parsed.groupBy("id", "name", "year").pivot("month", months).sum("value")

# +---+----+----+---+---+       
# | id|name|year|Feb|Jan|
# +---+----+----+---+---+
# |  2| a02|2011| 92|  0|
# |  1|a001|2010| 32|  0|
# |  1|a001|2011|108| 14|
# |  2| a02|2010|  0| 80|
# +---+----+----+---+---+
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • 1
    That was perfect, thank you! I added .orderBy("id", "year") to the end and got it to look exactly how I needed it. – Tibberzz May 24 '18 at 23:18