I need to display the Top 5 states and cities based on total number of reviews( review count in original schema mentioned below). The description of my DF (from Json file) is given below.
+-------------+--------------------+-------+
| col_name| data_type|comment|
+-------------+--------------------+-------+
| attributes|struct<Accepts Cr...| null|
| city| string| null|
|neighborhoods| array<string>| null|
| open| boolean| null|
| review_count| bigint| null|
| stars| double| null|
| state| string| null|
| type| string| null|
+-------------+--------------------+-------+
I tried like order by methods but did not work. Finally got to know about the window Function here
In the code that I wrote the value of review count is not the exact value as it is there in the Json file.
The code that I tried is:
val topcity=spark.sql("select city,state,review_count,RANK() OVER (ORDER BY review_count desc ) AS RANKING from yelp").show(5)
The following is the output that I am getting:
+-------------+-----+------------+-------+
| city|state|review_count|RANKING|
+-------------+-----+------------+-------+
| Pittsburgh| PA| 3| 1|
| Carnegie| PA| 3| 2|
| Carnegie| PA| 3| 3|
| Carnegie| PA| 3| 4|
| Pittsburgh| PA| 3| 5|
+-------------+--------------------+-----+
So My review count is only constant value of 3. So my questions are:
- Why the review count is constantly 3?
- What changes should I make to get the top 5 exact values of review count?