I have two tables, df and asm. df is a very large table while asm is a very small table
When I execute this code
df.groupBy(f.spark_partition_id()).count().show()
# assumption name is under index
df = df.join(f.broadcast(asm), df["join_asm"] == asm["index"]).drop("join_asm", "index")
df.groupBy(f.spark_partition_id()).count().show()
I inexplicably get this result
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 148| 50|
| 31| 50|
| 137| 50|
| 85| 50|
| 65| 50|
| 53| 50|
| 133| 50|
| 78| 50|
| 108| 50|
| 155| 50|
| 34| 50|
| 193| 50|
| 101| 50|
| 115| 50|
| 126| 50|
| 81| 50|
| 28| 50|
| 183| 50|
| 76| 50|
| 26| 50|
+--------------------+-----+
only showing top 20 rows
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|10000|
+--------------------+-----+
Nothing I've seen seems to be able to explain this. I've verified the asm is being broadcast but even if it wasn't I don't know why it would automatically coalesce all the data into a single partition.