spark program to find the city with maximum population

Question

Input files contains rows like below (state,city,population):

west bengal,kolkata,150000
karnataka,bangalore,200000
karnataka,mangalore,80000
west bengal,bongaon,50000
delhi,new delhi,100000
delhi,gurgaon,200000

I have to write a Spark (Apache Spark) program in both Python and Scala to find the city with maximum population. Output will be like this:

west bengal,kolkata,150000
karnataka,bangalore,200000
delhi,new delhi,100000

So I need a three column output for each state. It's easy for me to get the output like this:

west bengal,15000
karnataka,200000
delhi,100000

But to get the city having maximum population is getting difficult.

It's always helpful if you include what you've already tried (including code snippets) in your question. — Jerod Johnson, Nov 14 '16 at 14:40

score 1 · Accepted Answer · answered Nov 15 '16 at 08:50

In vanilla pyspark, map your data to a pair RDD where the state is the key, and the value is the tuple (city, population). Then reduceByKey to keep the largest city. Beware, in the case of cities with the same population it will keep the first one it encounters.

rdd.map(lambda reg: (reg[0],[reg[1],reg[2]])) .reduceByKey(lambda v1,v2: ( v1 if v1[1] >= v2[1] else v2))

The results with your data look like this:

[('delhi', ['gurgaon', 200000]), ('west bengal', ['kolkata', 150000]), ('karnataka', ['bangalore', 200000])]

score 0 · Answer 2 · answered Nov 14 '16 at 14:51

This should do the trick:

>>> sc = spark.sparkContext
>>> rdd = sc.parallelize([
    ['west bengal','kolkata',150000],
    ['karnataka','bangalore',200000],
    ['karnataka','mangalore',80000],
    ['west bengal','bongaon',50000],
    ['delhi','new delhi',100000],
    ['delhi','gurgaon',200000],
])

>>> df = rdd.toDF(['state','city','population'])
>>> df.show()
+-----------+---------+----------+
|      state|     city|population|
+-----------+---------+----------+
|west bengal|  kolkata|    150000|
|  karnataka|bangalore|    200000|
|  karnataka|mangalore|     80000|
|west bengal|  bongaon|     50000|
|      delhi|new delhi|    100000|
|      delhi|  gurgaon|    200000|
+-----------+---------+----------+


>>> df.groupBy('city').max('population').show()
+---------+---------------+
|     city|max(population)|
+---------+---------------+
|bangalore|         200000|
|  kolkata|         150000|
|  gurgaon|         200000|
|mangalore|          80000|
|new delhi|         100000|
|  bongaon|          50000|
+---------+---------------+

>>> df.groupBy('state').max('population').show()
+-----------+---------------+
|      state|max(population)|
+-----------+---------------+
|      delhi|         200000|
|west bengal|         150000|
|  karnataka|         200000|
+-----------+---------------+

Actually I wanted three columns in each row of the output having the state, name of the city having max population and the corresponding population. — Mrinal, Nov 14 '16 at 17:36

spark program to find the city with maximum population

2 Answers2