I have a Table loaded in Dataframe and I tried to use groupBy with PKs.
df_remitInsert = spark.sql("""SELECT * FROM trac_analytics.mainremitdata""")
df_remitInsert_filter = df_remitInsert.groupBy("LoanID_Serv", "LoanNumber", "Month").count().filter("count > 1").drop('count')
where, "LoanID_Serv", "LoanNumber", "Month"
are my Primary Keys.
I want to achieve entire data from df_remitInsert
which are deduplicated w.r.t Primary Keys.