I have been trying to identify and eliminate duplicates from a CSV input file.
Consider the below input file... In the input file, we need to find out if there are any "Fruit-Vegetable" pairs that when reversed are the same...
If they are the same (including the other attributes - Country, City and Postcode) then the first occurrence needs to be considered and the other duplicate can be eliminated.
If the other attributes do not match, a new column is added and the "Duplicate" label is added ...
Trying through multiple things I haven't been able to do the first step which is to identify the duplicates ...
If someone can help me with that, I should be able to proceed with the rest.. Thanks!
Input:
Fruit | Vegetable | Country | City | Postcode |
---|---|---|---|---|
Apple | Potato | Australia | Sydney | 2000 |
Potato | Apple | Australia | Sydney | 2000 |
Orange | Onion | Australia | Melbourne | 3000 |
Grapes | Beans | Australia | Perth | 6000 |
Beans | Grapes | Australia | Sydney | 2000 |
Output:
Fruit | Vegetable | Country | City | Postcode | Duplicate |
---|---|---|---|---|---|
Apple | Potato | Australia | Sydney | 2000 | NA |
Orange | Onion | Australia | Melbourne | 3000 | NA |
Grapes | Beans | Australia | Perth | 6000 | Duplicate1 |
Beans | Grapes | Australia | Sydney | 2000 | Duplicate1 |
I've tried to reverse the string and try to merge them to find the duplicates, but they are not getting eliminated. Tried various other similar answers on stackoverflow but not able to find the right logic to get this through.