1

I have loaded one csv file into my spark dataframe, after that if I try to calculate using approxQuantile method which is giving me an error. Tried with different data set and different columns, probablibities, and relativeError. Help me out understanding what's going on.

df.approxQuantile("column_name", [0.2,0.3,0.6,1.0], 0)

I am getting the following error :

py4j.protocol.Py4JError: An error occurred while calling o30.approxQuantile. Trace: py4j.Py4JException: Method approxQuantile([class scala.collection.immutable.$colon$colon, class scala.collection.immutable.$colon$colon, class java.lang.Double]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

MaFF
  • 9,551
  • 2
  • 32
  • 41
Sunil Rao
  • 800
  • 2
  • 6
  • 23
  • 1
    What's your data type (`df.printSchema()`)? – MaFF Sep 19 '17 at 12:34
  • All columns are of type "integer" root |-- j: integer (nullable = true) |-- b: integer (nullable = true) |-- f: integer (nullable = true) |-- l: integer (nullable = true) |-- e: integer (nullable = true) |-- c: integer (nullable = true) |-- g: integer (nullable = true) |-- h: integer (nullable = true) |-- m: integer (nullable = true) |-- a: integer (nullable = true) |-- k: integer (nullable = true) |-- d: integer (nullable = true) |-- i: integer (nullable = true) – Sunil Rao Sep 20 '17 at 05:46

1 Answers1

2

This can happen if your pyspark driver is using Spark 2.2.0 and your Spark cluster is running 2.1.1 (or earlier). Ensure that your driver & cluster versions match and you should be good to go!

See the note in the docs about a change to the interface for approxQuantile in 2.2:

Changed in version 2.2: Added support for multiple columns.

Tom Q.
  • 640
  • 5
  • 9
  • How to fix this error? And how to find out which are the versions of pyspark driver and spark cluster ? – Sunil Rao Oct 16 '17 at 09:26
  • 1
    To fix, find out if your driver or your cluster is running the older version. Then upgrade that component to match the version the other is running (probably by downloading from Spark's website). See here to determine your spark version: https://stackoverflow.com/questions/38586834/how-to-check-spark-version . It also looks like you're using pyspark. You can see your pyspark version with this: pip freeze | grep spark – Tom Q. Oct 16 '17 at 18:08