0

I have two tables, df and asm. df is a very large table while asm is a very small table

When I execute this code

        df.groupBy(f.spark_partition_id()).count().show()

        # assumption name is under index
        df = df.join(f.broadcast(asm), df["join_asm"] == asm["index"]).drop("join_asm", "index")
        df.groupBy(f.spark_partition_id()).count().show()

I inexplicably get this result

+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
|                 148|   50|
|                  31|   50|
|                 137|   50|
|                  85|   50|
|                  65|   50|
|                  53|   50|
|                 133|   50|
|                  78|   50|
|                 108|   50|
|                 155|   50|
|                  34|   50|
|                 193|   50|
|                 101|   50|
|                 115|   50|
|                 126|   50|
|                  81|   50|
|                  28|   50|
|                 183|   50|
|                  76|   50|
|                  26|   50|
+--------------------+-----+
only showing top 20 rows

+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
|                   0|10000|
+--------------------+-----+

Nothing I've seen seems to be able to explain this. I've verified the asm is being broadcast but even if it wasn't I don't know why it would automatically coalesce all the data into a single partition.

  • There are some posts here. May be they can help with partitioning behaviour during join: https://stackoverflow.com/questions/46773829/which-properties-has-spark-dataframe-after-join-of-two-dataframes-with-same-part, https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join – Azhar Khan Oct 25 '22 at 11:33

0 Answers0