I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order. For example the original dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| B|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
| 4| 3| G|
+----+----+----+
where the schema of the column as follows:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
The desired dataframe should look like:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
+----+----+----+
The dropDuplicates()
method remove duplicates if the values in the same order
I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.