22

How to subtract a set from another in Bash?

This is similar to: Is there a "set" data structure in bash? but different as it asks how to perform the subtraction, with code

  • set1: N lines as output by a filter
  • set2: M lines as output by a filter

how to get:

  • set3: with all lines in N which don't appear in M
Community
  • 1
  • 1
Robottinosino
  • 10,384
  • 17
  • 59
  • 97

5 Answers5

22
comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort)

comm without option display 3 columns of output: 1: only in first file, 2: only in second file, 3: in both files. -23 removes the second and third columns.

$ cat > file1.list
A
B
C
$ cat > file2.list
A
C
D
$ comm file1.list file2.list 
        A
B
        C
    D
$ comm -12 file1.list file2.list # In both
A
C
$ comm -23 file1.list file2.list # Only in set 1
B
$ comm -13 file1.list file2.list # Only in set 2
D

Input files must be sorted.

GNU sort and comm depends on locale, for example output order may be different (but content must be the same)

(export LC_ALL=C; comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort))
Community
  • 1
  • 1
Nahuel Fouilleul
  • 18,726
  • 2
  • 31
  • 36
5

uniq -u (manpage) is often the simplest tool for list subtraction:

Usage

uniq [OPTION]... [INPUT [OUTPUT]] 
[...]
-u, --unique
    only print unique lines

Example: list files found in directory a but not in b

$ ls a
file1  file2  file3
$ ls b
file1  file3

$ echo "$(ls a ; ls b)" | sort | uniq -u
file2
YSC
  • 38,212
  • 9
  • 96
  • 149
  • 3
    This is the symmetric difference, not the relative complement. Any unique elements in B will also be in the result. However, If there are no elements in B that are not in A, then this works well. – Brent Aug 29 '17 at 14:51
  • 2
    To echo @Brent this is technically not set subtraction. This is the symmetric difference betweeen two sets. It finds all files in only ONE of the two directories `a` and `b`. – makansij Sep 08 '17 at 03:41
2

I've got a dead-simple 1-liner:

$ now=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log)

$ later=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log baz foo)

$ printf "%s\n" ${now[@]} ${later[@]} | sort | uniq -c | grep -vE '[ ]+2.*' | awk '{print $2}'
baz
foo

By definition, 2 sets intersect if they have elements in common. In this case, there are 2 sets, so any count of 2 is an intersection - simply "subtract" them with grep

axsyse
  • 78
  • 4
Christian Bongiorno
  • 5,150
  • 3
  • 38
  • 76
1

I wrote a program recently called Setdown that does Set operations (like set difference) from the cli.

It can perform set operations by writing a definition similar to what you would write in a Makefile:

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

Its pretty cool and you should check it out. I personally don't recommend the "set operations in unix shell" post. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other.

At any rate, I think that it's pretty cool and you should totally check it out.

Robert Massaioli
  • 13,379
  • 7
  • 57
  • 73
0

You can use diff

# you should sort the output
ls > t1
cp t1 t2

I used vi to remove some entries from t2

$ cat t1
AEDWIP.writeMappings.sam
createTmpFile.sh*
find.out
grepMappingRate.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*

$ cat t2
AEDWIP.writeMappings.sam
createTmpFile.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*

diff reports lines in t1 that are not in t2

diff t1 t2
$ diff t1 t2
3,4d2
< find.out
< grepMappingRate.sh*

putting together version

diff t1 t2 | grep "^<" | cut -d " " -f 2
find.out
grepMappingRate.sh*
AEDWIP
  • 888
  • 2
  • 9
  • 22