I'm learning Pyspark, I used it to read a csv file into a dataframe (with column names):
>>> example_df.show(n=5)
+---------------------+------+-------------+---------+---------+-------------+------------+
| id| price| street| locality|town_city| district| country|
+---------------------+------+-------------+---------+---------+-------------+------------+
| bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
+---------------------+------+-------------+---------+---------+-------------+------------+
only showing top 5 rows
For the same location (= same value for street|locality|town_city|district|country
), I want to create a new location_id
, and group the corresponding fields, something like this:
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
|location_id| id| price| street| locality|town_city| district| country|
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
| 0| aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| 0| aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
| 1| bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| 1| bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE|
| 2| cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE|
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+
only showing top 5 rows
...
I know how to create the new column:
from pyspark.sql.functions import monotonically_increasing_id
df = df.WithColumn("location_id", monotonically_increasing_id())
This just create a new column with 0,1,2 etc
, but how can I group the data for the same location and also give them unique location_id
? Thanks.