Spark DataFrame Removing duplicates via GroupBy keep first

Question

I am using the groupBy function to remove duplicates from a spark DataFrame. For each group I simply want to take the first row, which will be the most recent one.

I don't want to perform a max() aggregation because I know the results are already stored sorted in Cassandra and want to avoid unnecessary computation. See this approach using pandas, its exactly what I'm after except in Spark.

df = sqlContext.read\
            .format("org.apache.spark.sql.cassandra")\
            .options(table="table", keyspace="keyspace")\
            .load()\
            .groupBy("key")\
            #what goes here?

score 7 · Accepted Answer · edited May 23 '17 at 12:26

7

Just dropDuplicates should do the job.

Try df.dropDuplicates(Seq("column")).show.

Check this question for more details.

edited May 23 '17 at 12:26

Community

1
1

answered Jul 20 '16 at 11:41

NehaM

1,272
1
18
32

1

Thanks for the answer. I actually switched to using this approach. `dropDuplicates(['column'])` also works – stacksonstacks Jul 20 '16 at 23:59
@stacksonstacks I would mark this as accepted as it addresses your root problem more directly, but you should also retitle the question. – Justin Pihony Sep 12 '16 at 04:16
@JustinPihony what do you propose for renamed title? – stacksonstacks Sep 12 '16 at 04:23
Get first row of each grouped in DataFrame or similar? @stacksonstacks – NehaM Sep 14 '16 at 10:34

Spark DataFrame Removing duplicates via GroupBy keep first

1 Answers1