Spark (or pyspark) columns content shuffle with GroupBy

Question

I'm working with Spark 2.2.0.

I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)

table.show(10)

+--------------------+-------------------+-----------------+
|              PERIOD|               TYPE| etc......
+--------------------+-------------------+-----------------+  
|                  W1|                 HM| 
|                  W2|                 SM|
|                  W3|                 HM|

etc...

I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)

total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))

total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")

total_stores2.show(10)

+--------------------+-------------------+-----------------+
|              PERIOD|               TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...|              BORGO|                1|
|        C ATHIS MONS|   ATHIS MONS CEDEX|                1|
|    CMA BOSC LE HARD|       BOSC LE HARD|                1|

The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....

I have no clue why. Everything else works fine.

GroupBy returns you unordered map so you can’t rely on the sequence. Of the element — Raman Mishra, May 20 '18 at 09:27
If you want to preserve the order you can use linked hashmap — Raman Mishra, May 20 '18 at 09:30
Possible duplicate of [Scala GroupBy preserving insertion order?](https://stackoverflow.com/questions/9594431/scala-groupby-preserving-insertion-order) — Raman Mishra, May 20 '18 at 10:08
GroupBy has nothing to do with spark it’s a Scala method you just need to apply on some iterable. So I don’t see any difference — Raman Mishra, May 20 '18 at 10:12
Sorry - I was not being very clear: can you suggest, in pyspark, few lines that would keep (or reorder) the map so that groupby does like a proper SQL groupby — Arnaud B., May 20 '18 at 10:20
If reorder will work then you can do total_stores.sortBy(key). Where key is the key of your map that is total_stores._1 — Raman Mishra, May 20 '18 at 10:24
by the way, I can understand the unordered set, what I do not get is why it keep the column headers sorted, essentially displaying a wrong result... — Arnaud B., May 20 '18 at 10:56
You have df.orderBy why can’t you use that? To order the columns — Raman Mishra, May 20 '18 at 10:57
Would you mind checking how to create a [mcve] and [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/9613318) and [edit] your question to reflect these? It is either very unusual bug or you data is just mixed up from the beginning. — Alper t. Turker, May 20 '18 at 12:45
wait - orderBy orders the rows in a column, like ascending or descending. This is not at all my issue. Look at the description again: in the column PERIOD for example (in which there should be weeks in the form Wn) I find store names which are in its own column before the Group by. — Arnaud B., May 20 '18 at 13:11
and, no, first thing I checked, the DF is clean before I do the groupby. — Arnaud B., May 20 '18 at 13:13
what I mean was: I have a sort problem alright, but not with the rows: with the columns! — Arnaud B., May 20 '18 at 13:25

Spark (or pyspark) columns content shuffle with GroupBy

0 Answers0