How do I remove duplicate rows from a table on the basis of Primary keys?

Question

I have a Table loaded in Dataframe and I tried to use groupBy with PKs.

df_remitInsert = spark.sql("""SELECT * FROM trac_analytics.mainremitdata""")
df_remitInsert_filter = df_remitInsert.groupBy("LoanID_Serv", "LoanNumber", "Month").count().filter("count > 1").drop('count')

where, "LoanID_Serv", "LoanNumber", "Month" are my Primary Keys.

I want to achieve entire data from df_remitInsert which are deduplicated w.r.t Primary Keys.

Does [this](https://stackoverflow.com/questions/35218882/find-maximum-row-per-group-in-spark-dataframe) help? — blackbishop, May 05 '22 at 10:35

score 0 · Answer 1 · answered May 05 '22 at 09:39

0

You can use the dropDuplicates method.

df_remitInsert_filter = df_remitInsert.dropDuplicates(['LoanID_Serv', 'LoanNumber', 'Month'])

answered May 05 '22 at 09:39

过过招

3,722
2
4
11

Is there a way to keep the latest row? – Crime_Master_GoGo May 05 '22 at 10:30

How do I remove duplicate rows from a table on the basis of Primary keys?

1 Answers1