0

I'm back again with another awk question.

I have multiple large files that contain data i want to dedupe against each other. Lets say I have the following data for one month:

fruit   number  rand
apple   12  342
taco    19  264
tortilla    2234    53423
tortillas   2   3431
apricot 13221   23424
apricots    24234   3252
pineapple   2342    2342
radish  1   3

The following month I receive this data:

fruit   number  rand
pineapple   2   698
apple   34  472
taco    19  234
tortilla    16  58
tortillas   87  25
potato  234 2342
radish  1   55
grapes  9   572 422
apricot 13221   24

What I am trying to do is take the second file, and check the values of the first column to see if there are items that exist in the first file. If yes, I want to remove them from the second file, leaving only items that are unique to the second file with relation to the first one.

The desired outcome would leave me something like this:

fruit   number  rand    DUPLICATE
pineapple   2   698     DUPE
apple   34  472         DUPE
taco    19  234         DUPE
tortilla    16  58      DUPE
tortillas   87  25      DUPE
potato  234 2342
radish  1   55          DUPE
grapes  9   572 422
apricot 13221   24      DUPE

Or, more clearly:

fruit   number  rand
potato  234 2342
grapes  9   572 422

I was trying to think of a way to do this without having to sort the files. I was trying to modify the answer from @karafka for a related question. Instead of passing the same file twice, I tried inputting the two different files. Obviously I'm doing something wrong.

awk 'BEGIN { FS = OFS = "      " }
    NR==FNR {a[$1]++; next}
    FNR==1  {print $0, "DUPLICATE"; next}
    $1 in a{if (a[$1]>1){print $(NF+1)="DUPE";delete a[$1]}}1' file{,}

I'm still learning awk, any help the community can provide is greatly appreciated, but I'll try to explain what I think the above program does.

  1. The first line sets the delimiter and the output delimiter to be a tab character.
  2. This line reads the first file and stores an array with a count of how many times an item appears in the list.
  3. This outputs the first line which is essentially the header, adding "DUPLICATE" at the end of the last item in the row
  4. (This is where I'm stuck) If the current value is found in the array "a" it should check if the stored value is greater than one. If yes, it should print the word "DUPE" in the last column. Finally it returns the entire line.

In the test files I keep getting everything marked as "DUPE" or nothing at all.

I've also thought of combining the files and deduping that way, but that would leave me with undesired left-over values from the first file.

What am I doing wrong?

Community
  • 1
  • 1
mikelcal
  • 5
  • 4
  • Doing my due diligence and researching more, I found this snippet. `awk -F, ' !seen[$2]++ { line[$2] = $0 } END { for(val in seen) if(seen[val]==1) print line[val] }' file` from a [similar question](http://stackoverflow.com/questions/22308082/remove-all-lines-from-file-with-duplicate-value-in-field-including-the-first-oc?rq=1). I will update if this gets me closer to my goal. – mikelcal Apr 13 '17 at 21:15

1 Answers1

2

I think what you're doing wrong is just trying to use a bunch of scripts that are unrelated to your current problem as your starting point.

It sounds like all you need is:

$ awk '
NR==FNR { file1[$1]; next }
FNR==1 || !($1 in file1)
' file1 file2
fruit   number  rand
potato  234 2342
grapes  9   572 422
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Thanks Ed Morton! This does exactly what i want it to do. You are correct, i was entering the data by hand and `radish` should not be part of the output. Can you please help me understand how this works? – mikelcal Apr 14 '17 at 02:32
  • Its vastly simpler than the scripts you've been using so - how do YOU think it works? I'll be happy to answer questions... – Ed Morton Apr 14 '17 at 02:45
  • 1
    I'll give it a shot – mikelcal Apr 14 '17 at 03:23