0

I have such a DF. I am working on tables in the form of rdd

enter image description here

I would like to get a table with the maximum order value for a given country and the customer number for this customer.

I have no idea how to construct the map function. Unless there is a better way?

DarrylG
  • 16,732
  • 2
  • 17
  • 23

1 Answers1

0

With PySpark:

df.groupBy('customernumber', 'city').max('sum_of_orders')

With Pandas:

df.groupby(['customernumber', 'city'])['sum_of_orders'].max()
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Does it work for you? – Corralien Dec 12 '21 at 22:11
  • Not completey.. :/ Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonPartitioner. Trace: py4j.Py4JException: Constructor org.apache.spark.api.python.PythonPartitioner([class java.lang.String, class java.lang.Long]) does not exist [...] py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:745) – WenDhoine Dec 12 '21 at 22:16
  • I need something like that df.rdd.reduceByKey(lambda x: (some function)) – WenDhoine Dec 12 '21 at 22:18
  • https://stackoverflow.com/questions/33716047/pyspark-grouby-and-then-get-max-value-of-each-group may solve your problem. – Corralien Dec 12 '21 at 22:22
  • at first glance, I think so :D thanks a lot – WenDhoine Dec 12 '21 at 22:27
  • I closed your question. If this does not solve your problem, contact me here. – Corralien Dec 12 '21 at 22:34