manipulating files using awk linux

Question

I have a 1.txt file (with field separator as ||o||):

aidagolf6@gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6@gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?
aidagolf6@gmail.com||o||bf6265003ae067b19b88fa4359d5c392||o||Aida||o||Aida||o||Garic Gara
aidagolf6@gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolfa@hotmail.com||o||14f87ec1e760d16c0380c74ec7678b04||o||Aida||o||Aida||o||Rodriguez Puerto

2.txt (with field separator as :):

bf6265003ae067b19b88fa4359d5c392:hyworebu:@
14f87ec1e760d16c0380c74ec7678b04:sujycugu

I have a result.txt file (which will match 2nd column of 1.txt with first column of 2.txt and if results match, it will replace the 2nd column of 1.txt with 2nd column of 2.txt)

aidagolf6@gmail.com||o||hyworebu:@||o||Aida||o||Aida||o||Garic Gara
aidagolfa@hotmail.com||o||sujycugu||o||Aida||o||Aida||o||Rodriguez Puerto

And a left.txt file (which contains unmatched rows from 1.txt that have no match in 2.txt):

aidagolf6@gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolf6@gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6@gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?

The script I am trying is:

awk -F '[|][|]o[|][|]' -v s1="||o||"  '
NR==FNR {
a[$2] = $1; 
b[$2]= $3s1$4s1$5; 
next
} 
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' 1.txt 2.txt

The problem is the script is using ||o|| in 2.txt also due to which I am getting wrong results.

EDIT

Modified script:

awk -v s1="||o||"  '
NR==FNR {
a[$2] = $1; 
b[$2]= $3s1$4s1$5; 
next
} 
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' FS = "||o||" 1.txt FS = ":" 2.txt

Now, I am getting following error:

awk: fatal: cannot open file `FS' for reading (No such file or directory)

I already know python. But the solution I need is for bash. Thanks for the help :) — Bhawan, Feb 28 '18 at 03:03
You can call python from bash, just like you called awk from bash. — melpomene, Feb 28 '18 at 03:07
Who invented this norm-of-o delimiter? More importantly why? — karakfa, Feb 28 '18 at 03:13
you can set different FS for different files, see: https://stackoverflow.com/questions/24516141/awk-processing-2-files-with-different-field-separators .. probably this qualifies as duplicate? — Sundeep, Feb 28 '18 at 04:57
I modified the script as suggested by you, but it is giving me some other error now. — Bhawan, Feb 28 '18 at 06:18
@BhawandeepSingla spaces are very important in cli... try with `FS='[|][|]o[|][|]'` and `FS=':'` — Sundeep, Feb 28 '18 at 08:18

tshiono · Accepted Answer · 2018-03-05T00:19:15.967

1

I've modified your original script:

awk -F'[|][|]o[|][|]' -v s1="||o||" '

NR == FNR {
    a[$2] = $1; 
    b[$2] = $3 s1 $4 s1 $5;
    c[$2] = $0;                     # keep the line for left.txt
}

NR != FNR {
    split($0, d, ":");
    r = substr($0, index($0, ":") + 1);     # right side of the 1st ":"
    if (a[d[1]] != "") {
        print a[d[1]] s1 r s1 b[d[1]] > "result.txt";
            c[d[1]] = "";           # drop from the list of left.txt
    }
}

END {
    for (var in c) {
        if (c[var] != "") {
            print c[var] > "left.txt"
        }
    }
}' 1.txt 2.txt

Next verion changes the order of file reading to reduce memory consumption:

awk -F'[|][|]o[|][|]' -v s1="||o||" '

NR == FNR {
    split($0, a, ":");
        r = substr($0, index($0, ":") + 1); # right side of the 1st ":"
    map[a[1]] = r; 
}

NR != FNR {
    if (map[$2] != "") {
        print $1 s1 map[$2] s1 $3 s1 $4 s1 $5 > "result.txt";
    } else {
        print $0 > "left.txt"
    }
}' 2.txt 1.txt

and the final version makes use of file-based database which minimizes DRAM consumption, although I'm not sure if Perl is acceptable in your system.

perl -e '
use DB_File;

$file1 = "1.txt";
$file2 = "2.txt";
$result = "result.txt";
$left = "left.txt";

my $dbfile = "tmp.db";
tie(%db, "DB_File", $dbfile, O_CREAT|O_RDWR, 0644) or die "$dbfile: $!";

open(FH, $file2) or die "$file2: $!";
while (<FH>) {
    chop;
    @_ = split(/:/, $_, 2);
    $db{$_[0]} = $_[1];
}
close FH;
open(FH, $file1) or die "$file1: $!";
open(RESULT, "> $result") or die "$result: $!";
open(LEFT, "> $left") or die "$left: $!";

while (<FH>) {
    @_ = split(/\|\|o\|\|/, $_);
    if (defined $db{$_[1]}) {
        $_[1] = $db{$_[1]};
        print RESULT join("||o||", @_);
    } else {
        print LEFT $_;
    }
}
close FH;
untie %db;
'
rm tmp.db

edited Mar 05 '18 at 00:19

answered Feb 28 '18 at 06:33

tshiono

21,248
2
14
22

It is working correct for smaller data. Let me test on bigger files. I will upvote it then. Thanks for the help. – Bhawan Feb 28 '18 at 06:38
for bigger files, I am getting more lines in result.txt than 1.txt – Bhawan Feb 28 '18 at 07:26
Can you show me the example file which reproduce the phenomenon? – tshiono Feb 28 '18 at 07:32
They are very big files, one is 13 GB and another is 3 GB. – Bhawan Feb 28 '18 at 09:05
Uh-oh, the sizes are much larger than I've expected. I cannot imagine the cause of the malfunction as of now, but it may not be a good idea to hold strings in associative arrays. Let me sleep over it. – tshiono Feb 28 '18 at 09:26
Although I've not found the real cause of the malfunction, I've added two options to reduce memory consumption. Greatly appreciated if you can try them. – tshiono Mar 01 '18 at 02:41
the only problem is if I have duplicate values(in 2nd column) in 1.txt, only the last one is written. It should print all in result.txt – Bhawan Mar 03 '18 at 06:34
You know I have proposed three versions as answers. Which version are you talking about? Do you have also duplicate values in 1st columns in 2.txt? If not, the 2nd script and the 3rd script would hopefully work as you expect. – tshiono Mar 04 '18 at 23:43

manipulating files using awk linux

1 Answers1