changed top N values in spark-sql

Question

I need to display the Top 5 states and cities based on total number of reviews( review count in original schema mentioned below). The description of my DF (from Json file) is given below.

+-------------+--------------------+-------+
|     col_name|           data_type|comment|
+-------------+--------------------+-------+
|   attributes|struct<Accepts Cr...|   null|
|         city|              string|   null|
|neighborhoods|       array<string>|   null|
|         open|             boolean|   null|
| review_count|              bigint|   null|
|        stars|              double|   null|
|        state|              string|   null|
|         type|              string|   null|
+-------------+--------------------+-------+

I tried like order by methods but did not work. Finally got to know about the window Function here

In the code that I wrote the value of review count is not the exact value as it is there in the Json file.

The code that I tried is:

val topcity=spark.sql("select city,state,review_count,RANK() OVER (ORDER BY review_count desc ) AS RANKING from yelp").show(5)

The following is the output that I am getting:

+-------------+-----+------------+-------+
|         city|state|review_count|RANKING|
+-------------+-----+------------+-------+
|   Pittsburgh|   PA|           3|      1|
|     Carnegie|   PA|           3|      2|
|     Carnegie|   PA|           3|      3|
|     Carnegie|   PA|           3|      4|
|   Pittsburgh|   PA|           3|      5|
+-------------+--------------------+-----+

So My review count is only constant value of 3. So my questions are:

Why the review count is constantly 3?
What changes should I make to get the top 5 exact values of review count?

Hello @sudarshan could you provide some sample data for the topcity query? Thanks — abiratsis, Mar 25 '19 at 18:19

abiratsis · Accepted Answer · 2019-03-25T20:19:20.773

The next is the implementation assuming that you are looking how to get total of reviews for each combination of state-city (hopefully I got it right):

First we generate some dummy data with:

cities_data = [
            ["Alameda", "California", 1],
            ["Alameda", "California", 3],
            ["Berkeley", "California", 2],
            ["Beverly Hills", "California", 2],
            ["Beverly Hills", "California", 3],
            ["Hollywood", "California", 4],
            ["Miami", "Florida", 3],
            ["Miami", "Florida", 2],
            ["Orlando", "Florida", 1],
            ["Cocoa Beach", "Florida", 1]]

cols = ["city", "state", "review_count"]
df = spark.createDataFrame(cities_data, cols)
df.show(10, False)

This will print:

+-------------+----------+------------+
|city         |state     |review_count|
+-------------+----------+------------+
|Alameda      |California|1           |
|Alameda      |California|3           |
|Berkeley     |California|2           |
|Beverly Hills|California|2           |
|Beverly Hills|California|3           |
|Hollywood    |California|4           |
|Miami        |Florida   |3           |
|Miami        |Florida   |2           |
|Orlando      |Florida   |1           |
|Cocoa Beach  |Florida   |1           |
+-------------+----------+------------+

The data is grouped by state/city in order to get the sum of total_reviews. This is in pyspark but should be very easy to change it to scala:

df = df.groupBy("state", "city") \
        .agg(F.sum("review_count").alias("reviews_count")) \
        .orderBy(F.desc("reviews_count")) \
        .limit(5)

And this should be the output for the scenario above:

+----------+-------------+-------------+
|state     |city         |reviews_count|
+----------+-------------+-------------+
|California|Beverly Hills|5            |
|Florida   |Miami        |5            |
|California|Alameda      |4            |
|California|Hollywood    |4            |
|California|Berkeley     |2            |
+----------+-------------+-------------+

You welcome @sudarshan let me know if you need more information. — abiratsis, Mar 25 '19 at 20:12
sure..any way sry for late reply as I had to analyse your code make changes and get back to u... — sudarshan, Mar 25 '19 at 20:13
No problem at all :) actually there is no need for window function in this case just a group by is enough! Please check the update — abiratsis, Mar 25 '19 at 20:20

changed top N values in spark-sql

1 Answers1