awk/sed remove duplicates and merge permuted columns

Question

I have the following file:

ABC     MNH     1
UHR     LOI     2    
QWE     LOI     3
MNH     ABC     4
PUQ     LOI     5
MNH     ABC     6
QWE     LOI     7
LOI     UHR     8

I want to remove all duplicates (based on the the first two columns - e.g., row 6 is a duplicate of row 4). Also I want to merge entries where column 1 and 2 are permuted (e.g., row 1 and 4). This means that this list should result in:

ABC     MNH     1 4
UHR     LOI     2 8
QWE     LOI     3
PUQ     LOI     5

However, this file is huge. About 2-3 TB. Can this be done with awk/sed?

That's a really big file. Normally it's amusing when people say they have a huge file, and it turns out to be 20 MB and we're like "no problem!". Whatever you try, or whatever is suggested, I would definitely try it on a 20 MB chunk of the file first, and then multiple the time it takes by 150000 to see if it's practical. Also, the typical way of solving this problem on an unsorted file in one pass is as you go along to load up memory with all of the unique order-independent pairs in your columns one and two. That could be practical or not depending on the percentage of duplicates. — jas, May 03 '19 at 10:33
How long are the strings in your actual file. Are they always 3 characters? This is just to know the amount of possible combinations. If they are 3 then you have only 26^6 possible unique combinations, hence it is manageable with awk. — kvantour, May 03 '19 at 15:09
Hang on. Are those numbers at the end of your lines REALLY present in your data or are you just trying to show us the input line numbers across the input/output? — Ed Morton, May 03 '19 at 16:34

Ed Morton · Answer 1 · 2019-05-03T22:40:04.827

I don't understand why what you posted is your expected output so you may have to massage it but IMHO this is right the way to approach the problem so that only "sort" is handling storing the multi-TB input internally (and sort is designed to do that with paging etc.) while the awk scripts are just processing one line at a time and keeping very little in memory:

$ cat tst.sh
#!/bin/env bash

awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -k1,2 |
awk '
    { curr = $1 OFS $2 }
    prev != curr {
        if ( NR>1 ) {
            print rec
        }
        rec = $0
        sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
        prev = curr
        next
    }
    { rec = rec OFS $NF }
    END { print rec }
'

$ ./tst.sh file
ABC     MNH     1 4 6
PUQ     LOI     5
QWE     LOI     3 7
LOI     UHR     8 2

An alternative implementation after discussing with @kvantour in the comments below (requires GNU sort for -s stable sort):

$ cat tst.sh
#!/bin/env bash

awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -s -k1,2 |
awk '
    { curr = $1 OFS $2 }
    prev != curr {
        if ( NR>1 ) {
            print rec
        }
        rec = $0
        sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
        sub(/[[:space:]]+[^[:space:]]+$/,"",rec)
        delete seen
        prev = curr
    }
    !seen[$3,$4]++ { rec = rec OFS $NF }
    END { print rec }
'

$ ./tst.sh file
ABC     MNH 1 4
PUQ     LOI 5
QWE     LOI 3
UHR     LOI 2 8

I'm not sure, but I don't think the first pipe and sort will be able to process the 2TB file. — kvantour, May 03 '19 at 22:30
Furthermore, in your example output the `6` in the first line should not appear as the key combination `MNH ABC` was already seen earlier with the value `4`. this also implies that the sort command could change the original order of duplicate keys which will affect the output. — kvantour, May 03 '19 at 22:36
Regarding the pipe and sort, some interesting information here: https://stackoverflow.com/questions/43362433/why-using-pipe-for-sort-linux-command-is-slow — kvantour, May 03 '19 at 22:46
I believe something like `sort -s -T /path/to/extra/harddisk -S4G` might do it. @riasc If the above did not work, please let us know and we'll try to come up with another solution . — kvantour, May 03 '19 at 23:25

Shawn · Answer 2 · 2019-05-03T14:13:35.843

The always helpful GNU datmash to the rescue!

$ sort -k1,2 -u input.txt |
   awk -v OFS="\t" '$2 < $1 { tmp = $1; $1 = $2; $2 = tmp } { print $1, $2, $3 }' |
   sort -k1,2 |
   datamash groupby 1,2 collapse 3 |
   tr ',' ' '
ABC MNH 1 4
LOI PUQ 5
LOI QWE 3
LOI UHR 2 8

Broken down, this:

Sorts the input file based on the first two columns and removes duplicates.
If the second column is less than the first column, swaps the two (So MNH ABC 6 becomes ABC MNH 6), and outputs tab-separated columns (Which is what datamash works with by default).
Sorts that so all the transformed rows are in order (But this time keeping duplicates).
Uses datamash to produce a single line for all the duplicate first two columns, with a comma-separated list of the values of the third columns as the third column of the output (Like ABC MNH 1,4)
Turns those commas into spaces.

Most memory-efficient solutions will require the data to be sorted, and while the sort program is quite good at doing that, it'll still use a bunch of temporary files so you'll need 2-3 or so terabytes of free disk space.

If you're going to be doing a lot of stuff with the same data, it's probably worth sorting it once and reusing that file instead of sorting it every time as the first step of a pipeline:

$ sort -k1,2 -u input.txt > unique_sorted.txt
$ awk ... unique_sorted.txt | ...

If there's enough duplicates and enough RAM that it's feasible to hold the results in memory, it can be done in one pass through the input file removing duplicates as it goes and then iterating through all the remaining pairs of values:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;

my %keys;
while (<>) {
  chomp;
  my ($col1, $col2, $col3) = split ' ';
  $keys{$col1}{$col2} = $col3 unless exists $keys{$col1}{$col2};
}

$, = " ";
while (my ($col1, $sub) = each %keys) {
  while (my ($col2, $col3) = each %$sub) {
    next unless defined $col3;
    if ($col1 lt $col2 && exists $keys{$col2}{$col1}) {
      $col3 .= " $keys{$col2}{$col1}";
      $keys{$col2}{$col1} = undef;
    } elsif ($col2 lt $col1 && exists $keys{$col2}{$col1}) {
      next;
    }
    say $col1, $col2, $col3;
  }
}

This produces output in arbitrary unsorted order for efficiency's sake.

And an approach using sqlite (Also requires lots of extra free disk space, and that the columns are separated by tabs, not arbitrary whitespace):

#!/bin/sh

input="$1"

sqlite3 -batch -noheader -list temp.db 2>/dev/null <<EOF 
.separator \t
PRAGMA page_size = 8096; -- Make sure the database can grow big enough
CREATE TABLE data(col1, col2, col3, PRIMARY KEY(col1, col2)) WITHOUT ROWID;
.import "$input" data
SELECT col1, col2, group_concat(col3, ' ')
FROM (
 SELECT col1, col2, col3 FROM data WHERE col1 < col2
 UNION ALL
 SELECT col2, col1, col3 FROM data WHERE col2 < col1 
 )
GROUP BY col1, col2
ORDER BY col1, col2;
EOF

rm -f temp.db

kvantour · Answer 3 · 2019-05-03T22:22:33.217

If your first two columns will only have 3 characters maximum you will have 26^6 possible combinations for the first two columns. This is very easy to handle with awk.

{ key1=$1$2; key2=$2$1 }
(key1 in a) { next }                   # duplicate :> skip
(key2 in a) { print $2,$1,a[key2],$3 } # permutation :> print
{ a[key1]=$3 }                         # store value

This however will only print the permutations, and as requested, maximum 2 elements. As a consequence, the array a will have both key1 and the permuted key key2 in the array in case a permutation is found, otherwise it will only have key1.

This can be cleaned up with a second array keeping track if a permutation is already printed. Call it b. This way you can eliminate 2 elements from a while keeping track of one element in b:

{ key1=$1$2; key2=$2$1 }
(key1 in b) || (key2 in b) { next }  # permutation printed, is duplicate
(key1 in a)                { next }  # only duplicate, no permutation found
(key2 in a) {                        # permutation found 
              print $2,$1,a[key2],$3 # - print
              delete a[key1]         # - delete keys from a
              delete a[key2]
              b[key1]                # - store key in b
              next                   # - skip the rest
            }
 { a[key1]=$3 }
 END { for (k in a) { print substr(1,3,k),substr(4,3,k),a[k] } }

@EdMorton i assumed always 3 characters. I wanted to save the byte of subsep — kvantour, May 03 '19 at 16:25
@EdMorton Each entry in the array `a` represents the original found `key1`. I test if `key1` is in the array to check for a duplicate, but if `key2` is in the array, we encountered a permutation. In the end you should have both `key1` and `key2` in the array which you can use for further duplicates. There is a way to clean up the array. — kvantour, May 03 '19 at 22:15

awk/sed remove duplicates and merge permuted columns

3 Answers3