Compare two files and output results according to conditionals

Question

I need to compare two input files (file1.csv and file2.csv) and store the results and a third file (file3.csv).

The conditions are as follows:

If entry matches file1 and file2, store it on file3
If entry is found on file1, store it on file3

It looks like this:

input: file1.csv

"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"

input: file2.csv

"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"

results: file3.csv

"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"switch6"
"switch10"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"

I have tried lots of combinations using 'awk'; however I cannot construct the conditionals.

I was wondering if someone could help me construct the conditional(s) to build this results file3.csv.

Assistance is much appreciated.

Can your quoted fields ever contain `;`s or escaped `"`s, e.g. `"foo""bar;"`? Does the output order matter? You appear to be sorting the first string both alphabetically (so `router` comes before `switch`) AND numerically (so `6` comes before `10`) - is that really what you need? — Ed Morton, Feb 25 '22 at 13:03
your file 3 CSV output is wrong, you must insert separators also for empty cells — aborruso, Feb 26 '22 at 14:00
@aborruso while that's true for CSVs that conform to RFC4180 and makes sense in general, that's not the only CSV "standard" and in reality a CSV can contain whatever the tool you're using to parse it allows as long as it's character-separated, e.g. a CSV with different numbers of fields per line would be just fine if opened by MS-Excel. — Ed Morton, Feb 27 '22 at 10:02

score 0 · Answer 1 · answered Feb 25 '22 at 08:37

I would use GNU AWK for this task following way, let file1.txt content be

"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"

and file2.txt content be

"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"

then

awk 'BEGIN{FS=";"}(NR==FNR){arr[$1]=$0}((NR!=FNR)&&($1 in arr)){arr[$1]=$0}END{for(i in arr){print arr[i]}}' file1.txt file2.txt

output

"switch6"
"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch10"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"

Disclaimer: I assume that you are able to accept any order of lines in output file. Explanation: I inform GNU AWK that ; is used as field separator (FS) then when processing first file (NR==FNR) I do fill array arr by putting whole current line ($0) under key which is content of first field ($1). When processing next files (NR!=FNR) if there is value corresponding to first field of line in array arr then I update its value to whole current line. After all files are processed I do print all values from array arr using for.

(tested in gawk 4.2.1)

Suggest using filenames op provided (file1.csv, file2.csv). Use -F\; instead of setting FS. — Allan Wind, Feb 25 '22 at 08:43
Instead of testing a condition and then testing the negation of the same condition, consider writing `NR==FNR{foo; next} {bar}` instead of `NR==FNR{foo} NR!=FNR{bar}` — Ed Morton, Feb 25 '22 at 13:21

Allan Wind · Answer 2 · 2022-02-28T08:11:47.967

0

Similar idea as @Daweo but:

use the file names provided
hard-code file names in script instead of arguments as they are not interchangeable
yield result in requested order

You would save the script as task, chmod 755 task then run it as ./task > file3.csv:

#!/usr/bin/env -S awk -F; -f

BEGIN {
    while ( (getline <"file2.csv") > 0 ) {
        lookup[$1] = $0
    }
    while ( ("sort -k1.1,1.7 -k1.8n file1.csv" | getline) > 0 ) {
        print ($0 in lookup) ? lookup[$0] : $0
    }
}

Note: sort used here is fragile as it relies on fixed size prefix (router or switch) to identify the two parts of the compound key.

Using gawk's PROCINFO["sorted_in"] feature to return keys in a specify order and using a custom compare function str_num_cmp() makes it a bit more robust:

#!/usr/bin/env -S awk -F; -f

function str_num_to_a(i, a) {
    match(i, /"([^0-9]*)([0-9]+)"/, a)
}

# a1/a2 are specified as arguments to make them local variables
function str_num_cmp(i1, v1, i2, v2, a1, a2) {
    str_num_to_a(i1, a1)
    str_num_to_a(i2, a2)
    if(a1[1] < a2[1])
        return -1
    if(a1[1] == a2[1]) {
        if(a1[2] < a2[2])
            return -1
        if(a1[2] == a2[2])
            return 0
    }
    return 1
}

BEGIN {
    while ( (getline <"file1.csv") > 0 ) {
        keys[$1] = ""
    }
    while ( (getline <"file2.csv") > 0 ) {
        lookup[$1] = $0
    }
    PROCINFO["sorted_in"] = "str_num_cmp"
    for (k in keys) {
        print (k in lookup) ? lookup[k] : k
    }
}

edited Feb 28 '22 at 08:11

answered Feb 25 '22 at 09:18

Allan Wind

23,068
5
28
38

That syntax for calling `getline` would spin off into an infinite loop if it encountered a problem since a failure return is a negative number which is treated as a true condition so the loop would just spin, failing on each call. See http://awk.freeshell.org/AllAboutGetline. You don't need `getline` here at all though, just let awk read the file in it's normal way, e.g. since you're using GNU awk, in `ARGIND==1` and `ARGIND==2` blocks. – Ed Morton Feb 25 '22 at 12:55
Thanks @EdMorton There are two input files and they treated differently so I don't think passing them in normal is the way to go. You could process file2.csv "normally" but that's kinda odd too. Updated to avoid the infinite loop but there are other error cases not handled like syntax not matching, values being different than example etc. – Allan Wind Feb 25 '22 at 20:09
You're welcome. I don't see any reason not to just let awk read both files as part of it's normal processing unless there's a performance issue with testing `NR==FNR` or equivalent for each input line but we can agree to disagree. – Ed Morton Feb 25 '22 at 21:04
Because the two files are not interchangeable. `task file1.csv file2.csv` is not the same as `task file2.csv file1.csv`. I would not be able to remember the correct usage. On the flip side, if you could specify as options that would be okay... `task --lookup file2.csv file1.csv`. Maybe it's just me of course. – Allan Wind Feb 26 '22 at 07:46
If you're concerned about getting the file names in the wrong order and are OK with hard-coding the input file names then you can always write `FILENAME=="file1.csv"{keys[$1]} FILENAME=="file2.csv"{lookup[$1] = $0}` or `BEGIN{ARGV[ARGC++]="file1.csv"; ARGV[ARGC++]="file2.csv"} NR==FNR{keys[$1]; next} {lookup[$1] = $0}` but the best thing to do given that concern and awk in a shell script is just not use a shebang (see https://stackoverflow.com/a/61002754/1745001 btw) and then write `awk 'NR==FNR{keys[$1]; next} {lookup[$1] = $0} ...' file1.csv file2.csv`. But, again, we can agree to disagree. – Ed Morton Feb 26 '22 at 13:07
1

@EdMorton It's always a good day when I learn something. I have a lots of good days :-). – Allan Wind Feb 26 '22 at 20:53
You'll really enjoy this then :-) : if you don't make `a1` and `a2` function-local variables in `str_num_cmp()` by declaring it as `function str_num_cmp(i1, v1, i2, v2, a1, a2)` then they are global variables and the way the sorting algorithm (quick sort) works is recursive descent so you could end up getting some very weird results or cryptic errors as the values in a1 and a2 get clobbered during the recursion. – Ed Morton Feb 26 '22 at 21:46
1

Indeed, I did :-) The above version and your suggested change behaves identical with supplied test data for me. Could be luck of course. I was mainly curious if awk would ignore the 2 extra unexpected arguments (it does and I somewhat conflicted about that). – Allan Wind Feb 27 '22 at 05:32
Adding unused arguments to a function is the one and only way to create function-local arguments. Awk doesn't ignore them, it declares them as such. Try adding `for (x in a1) print x` as the last line of your `BEGIN` section without my suggested changes and I expect you'll get an error message like `awk: cmd. line:15: fatal: internal error line 1349, file: /home/corinna/tmp/gawk-5.1.1/gawk-5.1.1-1.x86_64/src/gawk-5.1.1/eval.c.` Now add my suggested function arg changes to `str_num_cmp()` and the error will go away and your code will work. – Ed Morton Feb 27 '22 at 09:46
The point is bad things are happening internally with your code as written and just a tweak here and there would cause those to manifest externally. FYI by convention we add multiple white space (often 1 or 2 tabs) between the real function args and the function-local declarations, e.g. function `str_num_cmp(i1, v1, i2, v2, a1, a2)`, so it's clear when reading the code what those extra args are for. All variables used in a function should be declared as such unless you actually NEED them to be global so you don't get tripped up by same-name variable clashes inside and outside of functions – Ed Morton Feb 27 '22 at 09:46
See the `prt()` function declaration in the second script in [my answer](https://stackoverflow.com/a/71266556/1745001) for an example of how to declare all variables that should be function-local as such. – Ed Morton Feb 27 '22 at 09:52
@EdMorton Confirmed. Updating answer as suggested. – Allan Wind Feb 28 '22 at 08:11

Ed Morton · Accepted Answer · 2022-02-25T13:34:45.993

Using any awk in any shell on every Unix box and assuming that your quoted fields don't contain ;s and that you don't care about the output order:

$ cat tst.awk
BEGIN { FS=";" }
NR==FNR {
    first[$1]
    next
}
{
    print
    delete first[$1]
}
END {
    for ( i in first ) {
        print i
    }
}

$ awk -f tst.awk file1.csv file2.csv
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"switch6"
"router12"
"switch10"

If you do care about the output order you could use the decorate/sort/undecorate idiom with any awk+sort+cut:

$ cat tst.awk
BEGIN { FS=OFS=";" }
NR==FNR {
    first[$1]
    next
}
{
    prt($0)
    delete first[$1]
}
END {
    for ( i in first ) {
        prt(i)
    }
}

function prt(str,       arr, alpha, numeric) {
    split(str,arr)
    alpha = numeric = arr[1]
    sub(/[0-9].*/,"",alpha)
    gsub(/[^0-9]/,"",numeric)
    print alpha, numeric, str
}

$ awk -f tst.awk file1.csv file2.csv | sort -t';' -k1,1 -k2,2n | cut -d';' -f3-
"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"switch6"
"switch10"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"

score 0 · Answer 4 · answered Feb 26 '22 at 09:41

If awk is not mandatory, you can use the great and simple Miller

Running

mlr --csv --fs ";" -N join --ul -j 1 -f input_01.csv then unsparsify then sort -f 1 input_02.csv

you have

router12;;
router44;DatacenterC - cab2;Test - Tenant2
router51;DatacenterA - cab1;Prod - Tenant12
switch10;;
switch2;DatacenterA - cab3;Dev - Tenant5
switch33;DatacenterB - cab14;Prod - Tenant4
switch6;;

Some notes:

--csv --fs ";" to set that the input and the output is a CSV in which the field separator is ;
-N to set that the input and the output do not have header row;
join --ul -j 1 to run a left join based on the first column

Compare two files and output results according to conditionals

4 Answers4