PySpark - "compressing" multiple-row customers into one row, deleting blanks

Question

So I currently have a dataframe that looks like the following:

+-------------+----------------+---------------+------------------+-----------------+
| customer_id | init_base_date | init_end_date | reinit_base_date | reinit_end_date |
+-------------+----------------+---------------+------------------+-----------------+
| ...         |                |               |                  |                 |
| A           | 2015-07-30     |               |                  |                 |
| A           |                | 2016-07-24    |                  |                 |
| B           | 2015-07-10     |               |                  |                 |
| B           |                | 2015-10-05    |                  |                 |
| B           |                |               | 2016-01-09       |                 |
| B           |                |               |                  | 2016-07-04      |
| C           | 2015-05-13     |               |                  |                 |
| C           |                | 2015-08-09    |                  |                 |
| ...         |                |               |                  |                 |
+-------------+----------------+---------------+------------------+-----------------+

and I really need to convert it to the form:

+-------------+----------------+---------------+------------------+-----------------+
| customer_id | init_base_date | init_end_date | reinit_base_date | reinit_end_date |
+-------------+----------------+---------------+------------------+-----------------+
| ...         |                |               |                  |                 |
| A           | 2015-07-30     | 2016-07-24    |                  |                 |
| B           | 2015-07-10     | 2015-10-05    | 2016-01-09       | 2016-07-04      |
| C           | 2015-05-13     | 2015-08-09    |                  |                 |
| ...         |                |               |                  |                 |
+-------------+----------------+---------------+------------------+-----------------+

I can think of a couple really tedious ways to do the above, but I was wondering if there was a quick and efficient method (maybe using windows? I've only been using PySpark for a month now, so definitely still a novice).

Please read [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/8371915) and [edit] your question accordingly. — Alper t. Turker, Jul 11 '18 at 12:09

pault · Accepted Answer · 2018-07-11T14:45:05.543

If those empty cells you showed are actually nulls (as opposed to empty strings), you can use pyspark.sql.functions.first() as an aggregate function in a groupBy. The key is to set the ignorenulls argument of first() to True (it defaults to False).

import pyspark.sql.functions as f
cols = [c for c in df.columns if c != 'customer_id']
df.groupBy('customer_id').agg(*[f.first(c, True).alias(c) for c in cols]).show()
#+-----------+--------------+-------------+----------------+---------------+
#|customer_id|init_base_date|init_end_date|reinit_base_date|reinit_end_date|
#+-----------+--------------+-------------+----------------+---------------+
#|          A|    2015-07-30|   2016-07-24|            null|           null|
#|          B|    2015-07-10|   2015-10-05|      2016-01-09|     2016-07-04|
#|          C|    2015-05-13|   2015-08-09|            null|           null|
#+-----------+--------------+-------------+----------------+---------------+

If instead those blank values are actually empty strings, you could first replace all empty strings with null and follow the method above. Then you can (optionally) replace the null values with blanks.

from functools import reduce  # for python3
cols = [c for c in df.columns if c != 'customer_id']
df = reduce(lambda df, c: df.withColumn(c, f.when(f.col(c) != '', f.col(c))), cols, df)
df = df.groupBy('customer_id').agg(*[f.first(c, True).alias(c) for c in cols])
df.na.fill('').show()  # fill nulls with blanks
#+-----------+--------------+-------------+----------------+---------------+
#|customer_id|init_base_date|init_end_date|reinit_base_date|reinit_end_date|
#+-----------+--------------+-------------+----------------+---------------+
#|          A|    2015-07-30|   2016-07-24|                |               |
#|          B|    2015-07-10|   2015-10-05|      2016-01-09|     2016-07-04|
#|          C|    2015-05-13|   2015-08-09|                |               |
#+-----------+--------------+-------------+----------------+---------------+

PySpark - "compressing" multiple-row customers into one row, deleting blanks

1 Answers1