5

I couldn't find an answer that truly subtracts one file from another.

My goal is to remove lines in one file that occur in another file. Multiple occurences should be respected, which means for exammple if one line occurs 4 times in file A and only once in file B, file C should have 3 of those lines.

File A:

1
3
3
3
4
4

File B:

1
3
4

File C (desired output)

3
3
4

Thanks in advance

jas
  • 10,715
  • 2
  • 30
  • 41
Hawk
  • 63
  • 4
  • Use Perl. Load File B into a hash where the value = number of appearances of each value. For each line of file A, if it is found with a nonzero value, decrement the value. If it is not found, emit the line. – Ben Mar 06 '17 at 10:57
  • What have you tried so far? Any code produced by you? Any error message? – Jose Ricardo Bustos M. Mar 06 '17 at 11:02
  • @JoseRicardoBustosM. I tried adding all entries from file B to an array and deleting the elements of the array in file A, which unfortunately doesn't work that way in AWK. James Brown's answer seems to work well though – Hawk Mar 08 '17 at 08:48

3 Answers3

3

If the input files are already sorted as shown in sample, comm would be more suited

$ comm -23 f1 f2
3
3
4

option description from man page:

   -2     suppress column 2 (lines unique to FILE2)
   -3     suppress column 3 (lines that appear in both files)
Sundeep
  • 23,246
  • 2
  • 28
  • 103
3

In awk:

$ awk 'NR==FNR{a[$0]--;next} ($0 in a) && ++a[$0] > 0' f2 f1
3
3
4

Explained:

NR==FNR {                  # for each record in the first file
    a[$0]--;               # for each identical value, decrement a[value] (of 0)
    next
} 
($0 in a) && ++a[$0] > 0'  # if record in a, increment a[value]
                           # once over remove count in first file, output

If you want to print items in f1 that are not in f2 you can lose ($0 in a) &&:

$ echo 5 >> f1
$ awk 'NR==FNR{a[$0]--;next} (++a[$0] > 0)' f2 f1
3
3
4
5
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    If I understand the question correctly, if `3` occurs twice in `f2` and three times in `f1`, it should only occur once in the output. In this solution it will still appear twice. – jas Mar 06 '17 at 15:10
  • 1
    @jas At OP File C - expected output-, 3 appears twice. – George Vasiliou Mar 06 '17 at 15:18
  • 1
    Correct, @GeorgeVasiliou, but I mean a different test case, where File B contains `1 3 3 4`. (In that case, e.g., Sundeep's solution outputs just `3 4`.) – jas Mar 06 '17 at 15:26
  • 1
    @JamesBrown Could you please explain how the part `a[$0]-=1 > 0` works...? The part `($0 in a)` is clear, but i fail to understand the next condition. – George Vasiliou Mar 06 '17 at 15:31
  • @GeorgeVasiliou, @jas Had my `++` and `--` mixed up. This should work better... Thanks for pointing it out. – James Brown Mar 06 '17 at 20:54
  • 1
    @JamesBrown Nice update. Just a newbie question: is the >0 really required? I think not. The >0 is not arithmetic here. awk will print even with ++a[$0], right? – George Vasiliou Mar 06 '17 at 20:58
  • 1
    @GeorgeVasiliou I'd say it is as a negative value would render bare `++a[$0]` true. Right? Can't test it atm. – James Brown Mar 06 '17 at 22:00
  • 1
    @JamesBrown I'm 99% sure that you can simplify down to : `awk 'NR==FNR{a[$0]--;next}++a[$0]' f2 f1`. Give it a try when you will be able to do so. `++a[$0]` will return true (and thus will print) for all non zero values, no matter if it is positive or negative values. – George Vasiliou Mar 06 '17 at 22:09
  • Check. If you're sure, add your own answer. I'll ^. – James Brown Mar 06 '17 at 22:16
  • 1
    @JamesBrown Thanks! The second answer is what I was looking for. – Hawk Mar 08 '17 at 08:46
2

You can do:

awk 'NR==FNR{++cnt[$1]
             next}
     cnt[$1]-->0{next}
     1' f2 f1
dawg
  • 98,345
  • 23
  • 131
  • 206