union of two columns of a tsv file

Question

I've a file which stores a directed graph. Each line is represented as

node1 TAB node2 TAB weight

I want to find the set of nodes. Is there a better way of getting union? My current solution involves creating temporary files:

cut -f1 input_graph | sort | uniq > nodes1
cut -f2 input_graph | sort | uniq > nodes2
cat nodes1 nodes2 | sort | uniq > nodes

dan3 · Accepted Answer · 2013-09-26T08:04:41.637

3

{ cut -f1 input_graph; cut -f2 input_graph; } | sort | uniq

No need to sort twice.

The { cmd1; cmd2; } syntax is equivalent to (cmd1; cmd2) but may avoid a subshell.

In another language (e.g. Perl), you could slurp the first column in a hash and then process the second column sequentially.

With Bash only, you can avoid temporary files by using the syntax cat <(cmd1) <(cmd2). Bash takes care of creating temporary file descriptors and setting up pipelines.

In a script (where you may want to avoid requiring bash), if you end up needing temporary files, use mktemp

edited Sep 26 '13 at 08:04

answered Sep 26 '13 at 07:50

dan3

2,528
22
20

can you elaborate on `mktemp`? – damned Oct 23 '13 at 10:25
That was a marginal point, an FYI for your future scripting habits (which is why I mentioned it last). E.g. in your ORIGINAL code, you could use `mktemp` to generate temporary file names (instead of hard-coding filenames "nodes1" and "nodes2"): `NODES1=$(mktemp); cut -f1 input_graph | sort | uniq > "$NODES1"`. But of course there is no actual need for temporary files of any sort, with hard-coded names or not :) – dan3 Oct 23 '13 at 10:30

union of two columns of a tsv file

1 Answers1

Linked