I'm working with Spark 2.2.0.
I have a DataFrame
holding more than 20 columns. In the below example, PERIOD
is a week number and type a type of store (Hypermarket
or Supermarket
)
table.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE| etc......
+--------------------+-------------------+-----------------+
| W1| HM|
| W2| SM|
| W3| HM|
etc...
I want to do a simple groupby
(here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...| BORGO| 1|
| C ATHIS MONS| ATHIS MONS CEDEX| 1|
| CMA BOSC LE HARD| BOSC LE HARD| 1|
The problem is not in the calculation: the columns got mixed up: PERIOD
has STORE NAMES
, TYPE
has CITY
, etc....
I have no clue why. Everything else works fine.