How to join two data frames without messing up the order? (PySpark SQL)

Question

So I have this data frame hotel_weather_top_ten:

----------+-----+--------+----------+
|           id|month|abs_diff|row_number|
+-------------+-----+--------+----------+
| 996432412678|    8|     9.0|         1|
| 970662608897|    8|     9.0|         2|
| 730144440321|    8|     9.0|         3|
| 824633720837|    8|     9.0|         4|
| 953482739712|    8|     8.0|         5|
|1013612281861|    8|     8.0|         6|
|1288490188802|    8|     8.0|         7|
| 429496729601|    8|     8.0|         8|
| 429496729604|    8|     8.0|         9|
|1262720385026|    8|     8.0|        10|
| 146028888072|    9|    19.0|         1|
|1571958030336|    9|    19.0|         2|
| 146028888068|    9|    16.0|         3|
|  25769803779|    9|    15.0|         4|
|1099511627777|    9|    13.0|         5|
| 206158430212|    9|    13.0|         6|
|1262720385024|    9|    13.0|         7|
              ... and so on

Basically it just shows top 10 abs_diff value for every month in every hotel.

And now I wanna join it with hotel_weather_names so I have hotels names:

+-------------+--------------------+
|           id|                name|
+-------------+--------------------+
| 858993459206|   2608 E Malone Ave|
|1400159338497|     345 Chestnut St|
| 463856467968|      45400 Park Ave|
| 481036337152|  1118 Government St|
|1709396983810|165 Saint Emanuel St|
| 790273982465|1237 Us Highway 4...|
| 231928233986|       Maslinica Bay|
| 592705486849|   1223 Radford Blvd|
| 515396075520|    3750 Meridian St|
| 352187318275|      1407 E Rusk St|
|1202590842884|     2030 Formosa Rd|
| 747324309509|        602 E 4th St|
| 695784701955|10821 Caribbean Blvd|
| 644245094400|1725 Long Beach Blvd|
|1176821039105| 15902 S Western Ave|
|1365799600130|3201 E Pacific Co...|
| 309237645313|   8541 S Hampton Rd|
| 678604832769|       221 Shultz Rd|
| 755914244097|1320 Harrisburg Pike|
| 523986010113|        615 Clark Rd|
+-------------+--------------------+
only showing top 20 rows

But somehow when I do the join the result is not what I am expecting. e.g.:

|          id|month|abs_diff|row_number|             name|
+------------+-----+--------+----------+-----------------+
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
+------------+-----+--------+----------+-----------------+
only showing top 20 rows

So there is no top 10 anymore. How do I keep the order and just add names accordingly? This is how I am joining them: result = hotel_weather_top_ten.join(hotel_weather_names, on='id', how='inner')

Thank you beforehand!

score 1 · Accepted Answer · answered Dec 11 '21 at 18:54

1

You can do that with the following.

from pyspark.sql import functions

hotel_weather_top_ten = (hotel_weather_top_ten
         .withColumnRenamed("id", "id1")
         .withColumn("order", functions.monotonically_increasing_id()))

hotel_weather_top_ten = (hotel_weather_top_ten
        .join(hotel_weather_names,
              on=hotel_weather_top_ten["id1"] == hotel_weather_names["id"], 
              how="inner")
        .orderBy("order")
        .select("id","month","abs_diff","row_number","name")

answered Dec 11 '21 at 18:54

BoomBoxBoy

1,770
1
5
23

1

Hello! Thank you so much for help! Can you please explain, why do you rename `id` column? – Vlad Vlad Dec 11 '21 at 19:14
1

It just makes the column names unique for the select statement after the join. ie. There would be two columns named "id" and Pyspark wouldn't know which one to select. – BoomBoxBoy Dec 11 '21 at 19:53
1

See this question for more info -> [spark-dataframe-distinguish-columns-with-duplicated-name](https://stackoverflow.com/questions/33778664/spark-dataframe-distinguish-columns-with-duplicated-name) – BoomBoxBoy Dec 11 '21 at 19:54

Nithish · Answer 2 · 2021-12-11T19:16:45.130

1

The output you get, could be due to presence of duplicate id and name mapping in hotel_weather_names. Applying a hotel_weather_names.distinct() could be one potential solution.

edited Dec 11 '21 at 19:16

answered Dec 11 '21 at 19:08

Nithish

3,062
2
8
16

How to join two data frames without messing up the order? (PySpark SQL)

2 Answers2