How do I ignore first element in Groupby in Scala /Spark?

Question

I am using Spark2, Zeppelin and Scala to show the top 10 occurrences of words in a data set. My code:

z.show(dfFlat.groupBy("value").count().sort(desc("count")), 10)

gives: How do I ignore 'cat' and have the plot start from 'hat' i.e. show 2nd through last elements?

I tried:

z.show(dfFlat.groupBy("value").count().sort(desc("count")).slice(2,4), 10)

but this gives:

error: value slice is not a member of org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]

Raphael Roth · Accepted Answer · 2017-09-05T18:41:39.230

1

it's not straight forward to drop the first row in a dataframe (see also Drop first row of Spark DataFrame). But you can do it using window-functions:

val df = Seq(
  "cat", "cat", "cat", "hat", "hat", "bat"
).toDF("value")


val dfGrouped = df
  .groupBy($"value").count()
  .sort($"count".desc)

dfGrouped.show()

+-----+-----+
|value|count|
+-----+-----+
|  cat|    3|
|  hat|    2|
|  bat|    1|
+-----+-----+

val dfWithoutFirstRow = dfGrouped
  .withColumn("rank", dense_rank().over(Window.partitionBy().orderBy($"count".desc)))
  .where($"rank" =!= 1).drop($"rank") // this filters "cat"
  .sort($"count".desc)


dfWithoutFirstRow
  .show()

+-----+-----+
|value|count|
+-----+-----+
|  hat|    2|
|  bat|    1|
+-----+-----+

edited Sep 05 '17 at 18:41

answered Sep 05 '17 at 11:24

Raphael Roth

26,751
15
88
145

Thanks but it gives all rows still. – schoon Sep 05 '17 at 13:55
See my updated answer, i think the ordering of the window was wrong – Raphael Roth Sep 05 '17 at 14:03
Thanks but still the same :( – schoon Sep 05 '17 at 15:49
@schoon this solution does work, see my self-contained example code – Raphael Roth Sep 05 '17 at 18:37
Yes today it does! I have a lot of problems with Zeppelin or maybe I am just an idiot. Many thanks! – schoon Sep 06 '17 at 13:16

pasha701 · Answer 2 · 2017-09-06T08:01:43.287

0

First row can be removed in such way:

val filteredValue = dfGrouped.first.get(0)
val result = dfGrouped.filter(s"value!='$filteredValue'")

edited Sep 06 '17 at 08:01

answered Sep 05 '17 at 19:00

pasha701

6,831
1
15
22

While this code snippet may solve the problem, it doesn't explain why or how it answers the question. Please include an explanation for your code, as that really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion – Balagurunathan Marimuthu Sep 06 '17 at 06:24

How do I ignore first element in Groupby in Scala /Spark?

2 Answers2