3

I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order. For example the original dataframe:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   1|   A|
|   1|   1|   B|
|   2|   1|   C|
|   1|   2|   D|
|   3|   5|   E|
|   3|   4|   F|
|   4|   3|   G|
+----+----+----+

where the schema of the column as follows:

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = true)

The desired dataframe should look like:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   1|   A|
|   1|   2|   D|
|   3|   5|   E|
|   3|   4|   F|
+----+----+----+

The dropDuplicates() method remove duplicates if the values in the same order

I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.

Braiam
  • 1
  • 11
  • 47
  • 78
Ahmad Senousi
  • 613
  • 2
  • 12
  • 24

1 Answers1

4

You can use this: Hope this helps.

Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'.

from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True')

df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3)
df2.dropDuplicates(['col1','col2']).show()
WarBoy
  • 146
  • 1
  • 11
  • 1
    Welcome to the SO and allow me to be your first upvoter :) – cph_sto Mar 14 '19 at 07:57
  • Thanks a lot for your answer. To Make col3 contain 'C' you can make drop duplicate after selecting the col1 and col2 only and then make join with the original dataframe and then make dropduplicates again for all column to drop same values after make joining – Ahmad Senousi Mar 14 '19 at 08:47
  • Thank you sir, I would never have thought of this and it has come up repeatedly! – rjurney Feb 18 '21 at 03:09