1

I am trying to pivot a column which has more than 10000 distinct values. The default limit in Spark for maximum number of distinct values is 10000 and I am receiving this error

The pivot column COLUMN_NUM_2 has more than 10000 distinct values, this could indicate an error. If this was intended, set spark.sql.pivotMaxValues to at least the number of distinct values of the pivot column

How do I set this in PySpark?

j1897
  • 1,507
  • 5
  • 21
  • 41

1 Answers1

1

You have to add / set this parameter in the Spark interpreter.

I am working with Zeppelin notebooks on an EMR (AWS) cluster, had the same error message as you and it worked after I added the parameter in the interpreter.

Hope this helps...

AlexBerlin
  • 36
  • 2
  • I solved the problem by setting it before starting the Spark cluster. KEY is spark.sql.pivotMaxValues and I have set the VALUE to 100000 – j1897 Mar 22 '17 at 21:05
  • Can you elaborate on how you set that parameter value? Is it during the spark context call? etc –  Nov 20 '19 at 23:28