Max value from several row

Question

I have such a DF. I am working on tables in the form of rdd

I would like to get a table with the maximum order value for a given country and the customer number for this customer.

I have no idea how to construct the map function. Unless there is a better way?

score 0 · Answer 1 · answered Dec 12 '21 at 22:10

0

With PySpark:

df.groupBy('customernumber', 'city').max('sum_of_orders')

With Pandas:

df.groupby(['customernumber', 'city'])['sum_of_orders'].max()

answered Dec 12 '21 at 22:10

Corralien

Does it work for you? – Corralien Dec 12 '21 at 22:11
Not completey.. :/ Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonPartitioner. Trace: py4j.Py4JException: Constructor org.apache.spark.api.python.PythonPartitioner([class java.lang.String, class java.lang.Long]) does not exist [...] py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:745) – WenDhoine Dec 12 '21 at 22:16
I need something like that df.rdd.reduceByKey(lambda x: (some function)) – WenDhoine Dec 12 '21 at 22:18
https://stackoverflow.com/questions/33716047/pyspark-grouby-and-then-get-max-value-of-each-group may solve your problem. – Corralien Dec 12 '21 at 22:22
at first glance, I think so :D thanks a lot – WenDhoine Dec 12 '21 at 22:27
I closed your question. If this does not solve your problem, contact me here. – Corralien Dec 12 '21 at 22:34

1 Answers1