So there is a dataset of 10 columns with thousands of rows. Some rows have the same values in differents columns and the goal is to drop the rows and keep only the first one from that equivalence.
For the equivalence, it must be between 8 columns by pair:
IP_Src_x = IP_Src_y
IP_Dst_x = IP_Dst_y
Port_Src_x = Port_Src_y
Port_Dst_x = Port_Dst_y
Here is an example:
IP_Src_x | IP_Dst_x | Port_Src_x | Port_Dst_x | Length_x | IP_Src_y | IP_Dst_y | Port_Src_y | Port_Dst_y | length_y |
---|---|---|---|---|---|---|---|---|---|
10.00.000.00 | 90.00.000.00 | 1000 | 3000 | 33 | 90.00.000.00 | 10.00.000.00 | 3000 | 1000 | 75 |
60.50.500.00 | 30.000.300.00 | 8000 | 2000 | 88 | 30.000.300.00 | 60.50.500.00 | 2000 | 8000 | 68 |
90.00.000.00 | 10.00.000.00 | 3000 | 1000 | 75 | 10.00.000.00 | 90.00.000.00 | 1000 | 3000 | 33 |
66.00.000.00 | 10.00.000.00 | 5000 | 7000 | 665 | 10.00.000.00 | 66.00.000.00 | 7000 | 5000 | 18 |
88.80.000.00 | 40.00.400.40 | 1120 | 1860 | 75 | 40.00.400.40 | 88.80.000.00 | 1860 | 1120 | 180 |
10.00.000.00 | 66.00.000.00 | 7000 | 5000 | 18 | 66.00.000.00 | 10.00.000.00 | 5000 | 7000 | 665 |
30.00.300.00 | 60.50.000.00 | 2000 | 8000 | 68 | 30.00.300.00 | 60.50.500.00 | 8000 | 2000 | 88 |
40.00.400.40 | 88.80.000.00 | 1860 | 1120 | 180 | 88.80.000.00 | 40.00.400.40 | 1120 | 1860 | 75 |
20.60.700.40 | 10.80.000.00 | 2200 | 9900 | 60 | NULL | NULL | NULL | NULL | NULL |
Knowing that this dataset is already a result of a Grouby, so no more than 2 rows will meet the equivalence of a single row, it will be only pairs.
In our example, we can see that Row 1 (starting the counting from 1 and not 0) match Row 3, the row 2 match row 7, the row 3 match row 1, the row 4 match row 6....
By droping the second row found that let only the first one, we will have
IP_Src_X | IP_Dst_X | Port_Src_X | Port_Dst_X | Length_x | IP_Src_Y | IP_Dst_Y | Port_Src_Y | Port_Dst_Y | length_y |
---|---|---|---|---|---|---|---|---|---|
10.00.000.00 | 90.00.000.00 | 1000 | 3000 | 33 | 90.00.000.00 | 10.00.000.00 | 3000 | 1000 | 75 |
60.50.500.00 | 30.000.300.00 | 8000 | 2000 | 88 | 30.000.300.00 | 60.50.500.00 | 2000 | 8000 | 68 |
66.00.000.00 | 10.00.000.00 | 5000 | 7000 | 665 | 10.00.000.00 | 66.00.000.00 | 7000 | 5000 | 18 |
88.80.000.00 | 40.00.400.40 | 1120 | 1860 | 75 | 40.00.400.40 | 88.80.000.00 | 1860 | 1120 | 180 |
20.60.700.40 | 10.80.000.00 | 2200 | 9900 | 60 | NULL | NULL | NULL | NULL | NULL |
The matching rows can be in any position.
Thank you :)