Input files contains rows like below (state,city,population):
west bengal,kolkata,150000
karnataka,bangalore,200000
karnataka,mangalore,80000
west bengal,bongaon,50000
delhi,new delhi,100000
delhi,gurgaon,200000
I have to write a Spark (Apache Spark) program in both Python and Scala to find the city with maximum population. Output will be like this:
west bengal,kolkata,150000
karnataka,bangalore,200000
delhi,new delhi,100000
So I need a three column output for each state. It's easy for me to get the output like this:
west bengal,15000
karnataka,200000
delhi,100000
But to get the city having maximum population is getting difficult.