0

I am using Spark2, Zeppelin and Scala to show the top 10 occurrences of words in a data set. My code:

z.show(dfFlat.groupBy("value").count().sort(desc("count")), 10)

gives: enter image description here How do I ignore 'cat' and have the plot start from 'hat' i.e. show 2nd through last elements?

I tried:

z.show(dfFlat.groupBy("value").count().sort(desc("count")).slice(2,4), 10)

but this gives:

error: value slice is not a member of org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
schoon
  • 2,858
  • 3
  • 46
  • 78

2 Answers2

1

it's not straight forward to drop the first row in a dataframe (see also Drop first row of Spark DataFrame). But you can do it using window-functions:

val df = Seq(
  "cat", "cat", "cat", "hat", "hat", "bat"
).toDF("value")


val dfGrouped = df
  .groupBy($"value").count()
  .sort($"count".desc)

dfGrouped.show()

+-----+-----+
|value|count|
+-----+-----+
|  cat|    3|
|  hat|    2|
|  bat|    1|
+-----+-----+

val dfWithoutFirstRow = dfGrouped
  .withColumn("rank", dense_rank().over(Window.partitionBy().orderBy($"count".desc)))
  .where($"rank" =!= 1).drop($"rank") // this filters "cat"
  .sort($"count".desc)


dfWithoutFirstRow
  .show()

+-----+-----+
|value|count|
+-----+-----+
|  hat|    2|
|  bat|    1|
+-----+-----+
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
0

First row can be removed in such way:

val filteredValue = dfGrouped.first.get(0)
val result = dfGrouped.filter(s"value!='$filteredValue'")
pasha701
  • 6,831
  • 1
  • 15
  • 22
  • While this code snippet may solve the problem, it doesn't explain why or how it answers the question. Please include an explanation for your code, as that really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion – Balagurunathan Marimuthu Sep 06 '17 at 06:24