-3

Input files contains rows like below (state,city,population):

west bengal,kolkata,150000
karnataka,bangalore,200000
karnataka,mangalore,80000
west bengal,bongaon,50000
delhi,new delhi,100000
delhi,gurgaon,200000

I have to write a Spark (Apache Spark) program in both Python and Scala to find the city with maximum population. Output will be like this:

west bengal,kolkata,150000
karnataka,bangalore,200000
delhi,new delhi,100000

So I need a three column output for each state. It's easy for me to get the output like this:

west bengal,15000
karnataka,200000
delhi,100000

But to get the city having maximum population is getting difficult.

Community
  • 1
  • 1
Mrinal
  • 1,826
  • 2
  • 19
  • 31

2 Answers2

1

In vanilla pyspark, map your data to a pair RDD where the state is the key, and the value is the tuple (city, population). Then reduceByKey to keep the largest city. Beware, in the case of cities with the same population it will keep the first one it encounters.

rdd.map(lambda reg: (reg[0],[reg[1],reg[2]])) .reduceByKey(lambda v1,v2: ( v1 if v1[1] >= v2[1] else v2))

The results with your data look like this:

[('delhi', ['gurgaon', 200000]), ('west bengal', ['kolkata', 150000]), ('karnataka', ['bangalore', 200000])]

Manu Valdés
  • 2,343
  • 1
  • 19
  • 28
0

This should do the trick:

>>> sc = spark.sparkContext
>>> rdd = sc.parallelize([
    ['west bengal','kolkata',150000],
    ['karnataka','bangalore',200000],
    ['karnataka','mangalore',80000],
    ['west bengal','bongaon',50000],
    ['delhi','new delhi',100000],
    ['delhi','gurgaon',200000],
])

>>> df = rdd.toDF(['state','city','population'])
>>> df.show()
+-----------+---------+----------+
|      state|     city|population|
+-----------+---------+----------+
|west bengal|  kolkata|    150000|
|  karnataka|bangalore|    200000|
|  karnataka|mangalore|     80000|
|west bengal|  bongaon|     50000|
|      delhi|new delhi|    100000|
|      delhi|  gurgaon|    200000|
+-----------+---------+----------+


>>> df.groupBy('city').max('population').show()
+---------+---------------+
|     city|max(population)|
+---------+---------------+
|bangalore|         200000|
|  kolkata|         150000|
|  gurgaon|         200000|
|mangalore|          80000|
|new delhi|         100000|
|  bongaon|          50000|
+---------+---------------+

>>> df.groupBy('state').max('population').show()
+-----------+---------------+
|      state|max(population)|
+-----------+---------------+
|      delhi|         200000|
|west bengal|         150000|
|  karnataka|         200000|
+-----------+---------------+
Fokko Driesprong
  • 2,075
  • 19
  • 31
  • Actually I wanted three columns in each row of the output having the state, name of the city having max population and the corresponding population. – Mrinal Nov 14 '16 at 17:36