I'm back again with another awk
question.
I have multiple large files that contain data i want to dedupe against each other. Lets say I have the following data for one month:
fruit number rand
apple 12 342
taco 19 264
tortilla 2234 53423
tortillas 2 3431
apricot 13221 23424
apricots 24234 3252
pineapple 2342 2342
radish 1 3
The following month I receive this data:
fruit number rand
pineapple 2 698
apple 34 472
taco 19 234
tortilla 16 58
tortillas 87 25
potato 234 2342
radish 1 55
grapes 9 572 422
apricot 13221 24
What I am trying to do is take the second file, and check the values of the first column to see if there are items that exist in the first file. If yes, I want to remove them from the second file, leaving only items that are unique to the second file with relation to the first one.
The desired outcome would leave me something like this:
fruit number rand DUPLICATE
pineapple 2 698 DUPE
apple 34 472 DUPE
taco 19 234 DUPE
tortilla 16 58 DUPE
tortillas 87 25 DUPE
potato 234 2342
radish 1 55 DUPE
grapes 9 572 422
apricot 13221 24 DUPE
Or, more clearly:
fruit number rand
potato 234 2342
grapes 9 572 422
I was trying to think of a way to do this without having to sort the files. I was trying to modify the answer from @karafka for a related question. Instead of passing the same file twice, I tried inputting the two different files. Obviously I'm doing something wrong.
awk 'BEGIN { FS = OFS = " " }
NR==FNR {a[$1]++; next}
FNR==1 {print $0, "DUPLICATE"; next}
$1 in a{if (a[$1]>1){print $(NF+1)="DUPE";delete a[$1]}}1' file{,}
I'm still learning awk, any help the community can provide is greatly appreciated, but I'll try to explain what I think the above program does.
- The first line sets the delimiter and the output delimiter to be a tab character.
- This line reads the first file and stores an array with a count of how many times an item appears in the list.
- This outputs the first line which is essentially the header, adding "DUPLICATE" at the end of the last item in the row
- (This is where I'm stuck) If the current value is found in the array "a" it should check if the stored value is greater than one. If yes, it should print the word "DUPE" in the last column. Finally it returns the entire line.
In the test files I keep getting everything marked as "DUPE" or nothing at all.
I've also thought of combining the files and deduping that way, but that would leave me with undesired left-over values from the first file.
What am I doing wrong?