0

So there is a dataset of 10 columns with thousands of rows. Some rows have the same values in differents columns and the goal is to drop the rows and keep only the first one from that equivalence.

For the equivalence, it must be between 8 columns by pair:

IP_Src_x = IP_Src_y
IP_Dst_x = IP_Dst_y
Port_Src_x = Port_Src_y
Port_Dst_x = Port_Dst_y

Here is an example:

IP_Src_x IP_Dst_x Port_Src_x Port_Dst_x Length_x IP_Src_y IP_Dst_y Port_Src_y Port_Dst_y length_y
10.00.000.00 90.00.000.00 1000 3000 33 90.00.000.00 10.00.000.00 3000 1000 75
60.50.500.00 30.000.300.00 8000 2000 88 30.000.300.00 60.50.500.00 2000 8000 68
90.00.000.00 10.00.000.00 3000 1000 75 10.00.000.00 90.00.000.00 1000 3000 33
66.00.000.00 10.00.000.00 5000 7000 665 10.00.000.00 66.00.000.00 7000 5000 18
88.80.000.00 40.00.400.40 1120 1860 75 40.00.400.40 88.80.000.00 1860 1120 180
10.00.000.00 66.00.000.00 7000 5000 18 66.00.000.00 10.00.000.00 5000 7000 665
30.00.300.00 60.50.000.00 2000 8000 68 30.00.300.00 60.50.500.00 8000 2000 88
40.00.400.40 88.80.000.00 1860 1120 180 88.80.000.00 40.00.400.40 1120 1860 75
20.60.700.40 10.80.000.00 2200 9900 60 NULL NULL NULL NULL NULL

Knowing that this dataset is already a result of a Grouby, so no more than 2 rows will meet the equivalence of a single row, it will be only pairs.

In our example, we can see that Row 1 (starting the counting from 1 and not 0) match Row 3, the row 2 match row 7, the row 3 match row 1, the row 4 match row 6....

By droping the second row found that let only the first one, we will have

IP_Src_X IP_Dst_X Port_Src_X Port_Dst_X Length_x IP_Src_Y IP_Dst_Y Port_Src_Y Port_Dst_Y length_y
10.00.000.00 90.00.000.00 1000 3000 33 90.00.000.00 10.00.000.00 3000 1000 75
60.50.500.00 30.000.300.00 8000 2000 88 30.000.300.00 60.50.500.00 2000 8000 68
66.00.000.00 10.00.000.00 5000 7000 665 10.00.000.00 66.00.000.00 7000 5000 18
88.80.000.00 40.00.400.40 1120 1860 75 40.00.400.40 88.80.000.00 1860 1120 180
20.60.700.40 10.80.000.00 2200 9900 60 NULL NULL NULL NULL NULL

The matching rows can be in any position.

Thank you :)

0 Answers0