0

I'm working with Spark 2.2.0.

I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)

table.show(10)
+--------------------+-------------------+-----------------+
|              PERIOD|               TYPE| etc......
+--------------------+-------------------+-----------------+  
|                  W1|                 HM| 
|                  W2|                 SM|
|                  W3|                 HM|

etc...

I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)

total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))

total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")

total_stores2.show(10)
+--------------------+-------------------+-----------------+
|              PERIOD|               TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...|              BORGO|                1|
|        C ATHIS MONS|   ATHIS MONS CEDEX|                1|
|    CMA BOSC LE HARD|       BOSC LE HARD|                1|

The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....

I have no clue why. Everything else works fine.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Arnaud B.
  • 7
  • 3
  • GroupBy returns you unordered map so you can’t rely on the sequence. Of the element – Raman Mishra May 20 '18 at 09:27
  • If you want to preserve the order you can use linked hashmap – Raman Mishra May 20 '18 at 09:30
  • Possible duplicate of [Scala GroupBy preserving insertion order?](https://stackoverflow.com/questions/9594431/scala-groupby-preserving-insertion-order) – Raman Mishra May 20 '18 at 10:08
  • GroupBy has nothing to do with spark it’s a Scala method you just need to apply on some iterable. So I don’t see any difference – Raman Mishra May 20 '18 at 10:12
  • Sorry - I was not being very clear: can you suggest, in pyspark, few lines that would keep (or reorder) the map so that groupby does like a proper SQL groupby – Arnaud B. May 20 '18 at 10:20
  • If reorder will work then you can do total_stores.sortBy(key). Where key is the key of your map that is total_stores._1 – Raman Mishra May 20 '18 at 10:24
  • 'DataFrame' object has no attribute 'sortBy'... – Arnaud B. May 20 '18 at 10:31
  • by the way, I can understand the unordered set, what I do not get is why it keep the column headers sorted, essentially displaying a wrong result... – Arnaud B. May 20 '18 at 10:56
  • You have df.orderBy why can’t you use that? To order the columns – Raman Mishra May 20 '18 at 10:57
  • 1
    you must have done some nasty thing in the middle – Ramesh Maharjan May 20 '18 at 12:09
  • 1
    Would you mind checking how to create a [mcve] and [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/9613318) and [edit] your question to reflect these? It is either very unusual bug or you data is just mixed up from the beginning. – Alper t. Turker May 20 '18 at 12:45
  • wait - orderBy orders the rows in a column, like ascending or descending. This is not at all my issue. Look at the description again: in the column PERIOD for example (in which there should be weeks in the form Wn) I find store names which are in its own column before the Group by. – Arnaud B. May 20 '18 at 13:11
  • and, no, first thing I checked, the DF is clean before I do the groupby. – Arnaud B. May 20 '18 at 13:13
  • what I mean was: I have a sort problem alright, but not with the rows: with the columns! – Arnaud B. May 20 '18 at 13:25

0 Answers0