How to group data based on multiple columns and construct a new column - Pyspark

Question

I'm learning Pyspark, I used it to read a csv file into a dataframe (with column names):

>>> example_df.show(n=5)
+---------------------+------+-------------+---------+---------+-------------+------------+
|                   id| price|       street| locality|town_city|     district|     country|
+---------------------+------+-------------+---------+---------+-------------+------------+
|                bbbb1|295000|   HAYES YARD|   INGHAM|  LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
|                aaaa2|450000|   MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
|                bbbb2|280000|   HAYES YARD|   INGHAM|  LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
|                aaaa3|425000|   MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
|                cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
+---------------------+------+-------------+---------+---------+-------------+------------+
only showing top 5 rows

For the same location (= same value for street|locality|town_city|district|country), I want to create a new location_id, and group the corresponding fields, something like this:

+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
|location_id|                   id| price|       street| locality|town_city|     district|     country|
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
|          0|                aaaa2|450000|   MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
|          0|                aaaa3|425000|   MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
|          1|                bbbb1|295000|   HAYES YARD|   INGHAM|  LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
|          1|                bbbb2|280000|   HAYES YARD|   INGHAM|  LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
|          2|                cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
only showing top 5 rows
...

I know how to create the new column:

from pyspark.sql.functions import monotonically_increasing_id
df = df.WithColumn("location_id", monotonically_increasing_id())

This just create a new column with 0,1,2 etc, but how can I group the data for the same location and also give them unique location_id? Thanks.

werner · Accepted Answer · 2021-08-18T16:35:51.473

3

If you need just an id for each location you can use hash:

from pyspark.sql import functions as F

example_df.withColumn('location_id', 
    F.hash('street','locality','town_city','district','country')) \
    .show()

Output:

+-----+------+-------------+---------+---------+-------------+------------+-----------+
|   id| price|       street| locality|town_city|     district|     country|location_id|
+-----+------+-------------+---------+---------+-------------+------------+-----------+
|bbbb1|295000|   HAYES YARD|   INGHAM|  LINCOLN| WEST LINDSEY|LINCOLNSHIRE|  406601501|
|aaaa2|450000|   MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|  -97363104|
|bbbb2|280000|   HAYES YARD|   INGHAM|  LINCOLN| WEST LINDSEY|LINCOLNSHIRE|  406601501|
|aaaa3|425000|   MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|  -97363104|
|cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| 1396052469|
+-----+------+-------------+---------+---------+-------------+------------+-----------+

Edit: please remember that the hash function might cause hash collisions

edited Aug 18 '21 at 16:35

answered Aug 17 '21 at 15:58

werner

13,518
6
30
45

Thanks, can I ask why the some of the hash values are negative? And is there a way to move `location_id` to be the first column? – wawawa Aug 17 '21 at 21:35
1

@Cecilia the `hash` function might return negative and positive values. It is not an id generator (and might cause [hash collisions](https://en.wikipedia.org/wiki/Hash_collision) although it is not very probable if the dataset is not too large). To generate only positive values, please check out [this answer](https://stackoverflow.com/a/66052401/2129801). To reorder the columns, you can use `.select('location_id', 'id', 'price', ...)` – werner Aug 18 '21 at 16:34
Thanks, the answer you mentioned to convert the hashes to positive values are in Scala, just wondering how I can achieve it in Python? – wawawa Aug 22 '21 at 11:38
It seems like we can also use `F.sha2`: ://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.sha2.html but it seems like it doesn't take multiple columns, I tried `example_df.withColumn('location_id', F.sha2('street','locality','town_city','district','country'))` but it gave me error `TypeError: sha2() takes 2 positional arguments but 5 were given` – wawawa Aug 22 '21 at 11:53

How to group data based on multiple columns and construct a new column - Pyspark

1 Answers1