86

I have two files A-nodes_to_delete and B-nodes_to_keep. Each file has a many lines with numeric ids.

I want to have the list of numeric ids that are in nodes_to_delete but NOT in nodes_to_keep, i.e. A\B

Doing it within a PostgreSQL database is unreasonably slow. Any neat way to do it in bash using Linux CLI tools?

UPDATE: This would seem to be a Pythonic job, but the files are really, really large. I have solved some similar problems using uniq, sort and some set theory techniques. This was about two or three orders of magnitude faster than the database equivalents.

Arne
  • 17,706
  • 5
  • 83
  • 99
Adam Matan
  • 128,757
  • 147
  • 397
  • 562
  • I'm curious as to what answers will come. Bash is a bit more segphault, system admin I believe. If you would have said "in python" or "in php" or whatever your chances would have been better :) – extraneon Mar 24 '10 at 16:45
  • I saw the title and was all ready to bash UI inconsistencies and holier-than-thou help forums. This left me disappointed when I read the actual question. :( – aehiilrs Mar 24 '10 at 16:47

7 Answers7

120

The comm command does that.

msw
  • 42,753
  • 9
  • 87
  • 112
  • 11
    And if the files are not sorted yet, `sort` first. – extraneon Mar 24 '10 at 16:47
  • 2
    +1 Enlightened, great tool that I feel stupid not to have known. Thanks! – Adam Matan Mar 24 '10 at 17:10
  • 9
    @Just Won't start a flame war here, but you comment is just rude. – Adam Matan Mar 24 '10 at 18:43
  • 4
    @Adam: Ironically, that "comm" bit of arcana dates back to a time when you could keep the whole contents of /bin and /usr/bin in your head, before all these fancy perls and pythons and mysqls. Back in those simpler V7 days you had to make use of all the tools or (gasp!) write your own, with ed(1), in the snow, uphill both ways, and we liked it! ;) I'd probably never know of comm if I'd started later. – msw Mar 24 '10 at 23:13
  • @msw True, and its still amazing how useful can these CLI tools be, even with the new shining DBMS. – Adam Matan Mar 25 '10 at 08:33
  • 4
    @Adam Matan: I'm sorry, rudeness definitely wasn't my intention. In fact, the command I posted is a good way to learn a great deal about the system, and I used to do stuff like that to enlighten myself. Otherwise e. g. `join(1)` would have remained unknown to me. – just somebody Mar 26 '10 at 18:00
  • `comm -2 A B | sed '/^\t.*$/d'` – wieczorek1990 Oct 06 '14 at 10:49
  • would be great if this answer didn't link off-site – activedecay Jan 18 '17 at 01:00
  • 2
    Since this is the accepted answer, can you include the command line that does what the OP asked for, like @activedecay does below? – Bill Apr 07 '19 at 12:29
66

Set operations between files $1 and $2

set_union () {
   sort $1 $2 | uniq
}

set_intersection () {
   sort $1 $2 | uniq -d
}

set_difference () {
   sort $1 $2 $2 | uniq -u
}

set_symmetric_difference() {
   sort $1 $2 | uniq -u
}
  • uniq prints common lines
  • uniq -u (--unique) prints only non-duplicated lines
  • uniq -d (--repeated) prints only duplicated lines

NOTE: These assume that the files $1 and $2 do not individually contain duplicate lines. If they do, consider preprocessing them with $(sort $1 | uniq) to eliminate the duplicate lines first.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
slinkp
  • 3,406
  • 2
  • 21
  • 18
  • 2
    i think this is better than the accepted answer... ``comm`` is not available in all environments. – danwyand Jun 07 '14 at 18:26
  • 3
    That's symmetric difference, not normal set difference. – Tgr Mar 17 '15 at 00:56
  • @Tgr pretty sure it's normal set difference. – orip Aug 13 '15 at 14:20
  • @wieczorek1990 I'm not sure what examples with stdin work for the sort+uniq solutions that won't for comm, but in any case - for both comm and sort+uniq - this approach usually wins (showing Peteris Krumins's comm example for set difference) 'cmd -23 <(sort file1) <(sort file2)' See http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/ – orip Aug 13 '15 at 14:22
  • @PeterK I may be misunderstanding so lets try an example: `printf "0\n1\n" > a.txt; printf "1\n2\n" > b.txt; cat a.txt b.txt b.txt | sort | uniq -u`. If this were symmetric difference this would print "0" and "2", but it only prints "0" because - AFAICT - it's normal set difference. – orip Aug 09 '16 at 21:55
  • 2
    @Tgr my answer would give symmetric difference if each file was read once, but the second file is read twice, so the output is only the lines unique to the first file. – slinkp Aug 10 '16 at 22:29
  • @orip thanks for noticing that `cat` can be removed. I'll update my answer. – slinkp Aug 10 '16 at 22:30
  • 6
    `set_difference` and `set_symmetric_difference` won't always work correctly - they will drop lines unique to the first input file if those lines are not unique within that file. – Leon Oct 10 '16 at 07:03
  • 1
    @Leon how can something be unique but not unique, dont get it – neboduus Jan 22 '21 at 12:27
  • @neboduus When talking about two files "unique to the first file" means "occurring in the first file but not in the second file". "Unique within a file" means just that. – Leon Jan 22 '21 at 17:37
  • @Leon For these algorithms to work, the original files must not contain duplicate lines. Regardless of an entry being _unique to_ a file, each entry must be _unique within_ a file -- the files themselves must represent a set. – neuralmer Aug 11 '22 at 14:44
17

Use comm - it will compare two sorted files line by line.

The short answer to your question

This command will return lines unique to deleteNodes, and not in keepNodes.

comm -1 -3 <(sort keepNodes) <(sort deleteNodes)

Example setup

Let's create the files named keepNodes and deleteNodes, and use them as unsorted input for the comm command.

$ cat > keepNodes <(echo bob; echo amber;)
$ cat > deleteNodes <(echo bob; echo ann;)

By default, running comm without arguments prints 3 columns with this layout:

$ comm FILE1 FILE2
lines_unique_to_FILE1
    lines_unique_to_FILE2
        lines_which_appear_in_both

Using our example files above, run comm without arguments. Note the three columns.

$ comm <(sort keepNodes) <(sort deleteNodes)
amber
    ann
        bob

Suppressing column output

Suppress column 1, 2 or 3 with -N; note that when a column is hidden, the whitespace shrinks up.

$ comm -1 <(sort keepNodes) <(sort deleteNodes)
ann
    bob
$ comm -2 <(sort keepNodes) <(sort deleteNodes)
amber
    bob
$ comm -3 <(sort keepNodes) <(sort deleteNodes)
amber
    ann
$ comm -1 -3 <(sort keepNodes) <(sort deleteNodes)
ann
$ comm -2 -3 <(sort keepNodes) <(sort deleteNodes)
amber
$ comm -1 -2 <(sort keepNodes) <(sort deleteNodes)
bob

Sorting is important!

If you execute comm without first sorting the file, it fails gracefully with a message about which file is not sorted.

comm: file 1 is not in sorted order

activedecay
  • 10,129
  • 5
  • 47
  • 71
  • 1
    +1 for correct examples that include the answer to the OP's specific question (output lines in `deleteNodes` that are not in `keepNodes`), but would be better if the correct solution was highlighted: `comm -1 -3 <(sort keepNodes) <(sort deleteNodes)`. – John B Feb 12 '17 at 19:19
5

comm was specifically designed for this kind of use case, but it requires sorted input.

awk is arguably a better tool for this as it's fairly straight forward to find set difference, doesn't require sort, and offers additional flexibility.

awk 'NR == FNR { a[$0]; next } !($0 in a)' nodes_to_keep nodes_to_delete

Perhaps, for example, you'd like to only find the difference in lines that represent non-negative numbers:

awk -v r='^[0-9]+$' 'NR == FNR && $0 ~ r {
    a[$0]
    next
} $0 ~ r && !($0 in a)' nodes_to_keep nodes_to_delete
John B
  • 3,566
  • 1
  • 16
  • 20
1

Maybe you need a better way to do it in postgres, I can pretty much bet that you won't find a faster way to do it using flat files. You should be able to do a simple inner join and assuming that both id cols are indexed that should be very fast.

Dark Castle
  • 1,289
  • 2
  • 9
  • 20
  • You're technically correct, and the `explain` supports your claim, but it simply doesn't work for very large (~tens of millions) tables. – Adam Matan Mar 24 '10 at 17:10
  • 1
    Yeah it would be constrained by your memory unlike something like a sorted comm but I would think that if you have two tables with only an int id field that you could get into the 10s of millions with no trouble. – Dark Castle Mar 24 '10 at 17:14
  • That's right in theory, but it simply doesn't work for some reason. – Adam Matan Mar 24 '10 at 17:23
1

Another portable solution, which also works in case of multisets, a set which allows multiple instance of an element, is to use grep with patterns in a separate file:

grep -Fvx -f B A

The parameters:

  • -f: a file containing a list of patterns, one by line
  • -F: treat the patterns as string, not regex
  • -x: match whole lines in A-nodes_to_delete
  • -v: invert the matching (match if does not match)

If the patterns in B do not match a line in A, the command outputs the line otherwise nothing.

A nice feature of this solution is that it is possible to make it work with multi-column files (for A) while comm and uniq -u solutions require one column files.

Alenvers
  • 11
  • 2
0

So, this is slightly different from the other answers. I can't say that a C++ compiler is exactly a "Linux CLI tool", but running g++ -O3 -march=native -o set_diff main.cpp (with the below code in main.cpp can do the trick):

#include<algorithm>
#include<iostream>
#include<iterator>
#include<fstream>
#include<string>
#include<unordered_set>

using namespace std;

int main(int argc, char** argv) {
    ifstream keep_file(argv[1]), del_file(argv[2]);
    unordered_multiset<string> init_lines{istream_iterator<string>(keep_file), istream_iterator<string>()};
    string line;
    while (getline(del_file, line)) {
        init_lines.erase(line);
    }
    copy(init_lines.begin(),init_lines.end(), ostream_iterator<string>(cout, "\n"));
}

To use, simply run set_diff B A (not A B, since B is nodes_to_keep) and the resulting difference will be printed to stdout.

Note that I've forgone a few C++ best practices to keep the code simpler.

Many additional speed optimizations could be made (at the price of more memory). mmap would also be particularly useful for large data sets, but that'd make the code much more involved.

Since you mentioned that the data sets are large, I thought that reading nodes_to_delete a line at a time might be a good idea to reduce memory consumption. The approach taken in the code above isn't particularly efficient if there are lots of dupes in your nodes_to_delete. Also, order is not preserved.


Something easier to copy and paste into bash (i.e. skipping creation of main.cpp):

g++ -O3 -march=native -xc++ -o set_diff - <<EOF
#include<algorithm>
#include<iostream>
#include<iterator>
#include<fstream>
#include<string>
#include<unordered_set>

using namespace std;

int main(int argc, char** argv) {
        ifstream keep_file(argv[1]), del_file(argv[2]);
        unordered_multiset<string> init_lines{istream_iterator<string>(keep_file), istream_iterator<string>()};
        string line;
        while (getline(del_file, line)) {
                init_lines.erase(line);
        }
        copy(init_lines.begin(),init_lines.end(), ostream_iterator<string>(cout, "\n"));
}
EOF
YenForYang
  • 2,998
  • 25
  • 22