how to find out common columns and its records from two files using awk

Question

I have two files:

File 1:

id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio

File 2(only headers):

id|name|country

Now, I want an output like:

OUTPUT:

id|name|country
1|abc|xyz
2|asd|uio

Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.

I want to do this using awk or bash.

I tried using:

awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt

and have no idea what to do next.

Thank You

`a[(FNR"")]` = simply `a[FNR]`. Not sure what you were thinking... — Ed Morton, May 16 '18 at 18:02

score 2 · Accepted Answer · edited May 16 '18 at 17:49

2

Following awk may help you on same.

awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2  File1

Adding a non-one liner form of solution too now.

awk -F"|" '
FNR==NR{
   for(i=1;i<=NF;i++){
     a[$i]};
   next}
FNR==1 && FNR!=NR{
   for(j=1;j<=NF;j++){
     if($j in a){ b[++p]=j }}
}
{
   for(o=1;o<=p;o++){
     printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2  File1

Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:

BEGIN { FS=OFS="|" }
NR==FNR {
    for (i=1; i<=NF; i++) {
        names[$i]
    }
    next
}
FNR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in names) {
            f[++numFlds] = i
        }
    }
}
{
    for (i=1; i<=numFlds; i++) {
        printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
    }
}

edited May 16 '18 at 17:49

Ed Morton

188,023
17
78
185

answered May 16 '18 at 12:52

RavinderSingh13

130,504
14
57
93

1

You don't need the `&& FNR!=NR`. I wish you used a more normal white space/indenting and some more meaningful variable names. I didn't understand what yours was doing til after I wrote mine and then realized they were the same! FWIW I also hate seeing OFS set separately from FS when they MUST have the same value, and I hate seeing any variables (e.g. OFS) set AFTER the script in which those variables are used unless there's a very good reason for it (e.g. to avoid backslash expansion) - both things just obfuscate the code and make it harder to maintain. Just do `BEGIN{FS=OFS="|"}` up front. – Ed Morton May 16 '18 at 16:36
1

@EdMorton, thank you Ed sir, I will try to improve the things you have mentioned here, great to hear your feed back always sir. – RavinderSingh13 May 16 '18 at 17:51
2

if in doubt about how to format your code, just run a C beautifier (e.g. see https://codebeautify.org/c-formatter-beautifier) on your script (the part between the `'`s). Awk is similar-enough to C that that'll almost always work to output a clear, commonly-formatted program. There's command-line alternatives like `indent` and `cb` too. – Ed Morton May 16 '18 at 17:53
1

@EdMorton, that's great Ed sir, I could learn from it now :) you ROCK. – RavinderSingh13 May 16 '18 at 17:55
2

All those brackets make my head go nil. `(funcall (lambda () "Hello World")) ` – James Brown May 16 '18 at 18:00

score 1 · Answer 2 · answered May 16 '18 at 15:15

with (lot's of) unix pipes as Doug McIlroy intended...

$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file

id|name|country
1|abc|xyz
2|asd|uio

score 0 · Answer 3 · answered May 16 '18 at 13:12

Solution using bash>4:

IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'


# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[@]}-1))); do
        for j in $(seq 0 $((${#headers1[@]}-1))); do
                if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
                        idx+=($j)
                        break
                fi
        done
done
# idx=(0 1 3) for example

# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }

# first line - output headers
join_by '|' "${headers2[@]}"

isfirst=true
while IFS='|' read -a vals; do
        # ignore first (header line)
        if $isfirst; then
                isfirst=false
                continue;
        fi;
        # filter from line only columns with idx indices
        tmp=()
        for i in ${idx[@]}; do 
             tmp+=("${vals[$i]}")
        done
        # join ouptut with '|'
        join_by '|' "${tmp[@]}"
done < $file1

Don't do this. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for some of the reasons. — Ed Morton, May 16 '18 at 16:06

Kusalananda · Answer 4 · 2018-05-16T18:24:45.810

0

This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.

It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.

BEGIN       { FS = OFS = "|" }

# read headers from first file
NR == FNR   { for (i = 1; i <= NF; ++i) header[$i]; next }

# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1    {
    for (i = 1; i <= NF; ++i)
        select[i] = ($i in header)
}

{
    skip = 0
    pos  = 1
    for (i = 1; i <= NF; ++i)
        if (!select[i]) {          # we don't want this field
            ++skip
            $pos = $(pos + skip)   # shift fields left
        } else
            ++pos

    NF -= skip  # adjust number of fields
    print
}

Running this:

$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

edited May 16 '18 at 18:24

answered May 16 '18 at 13:43

Kusalananda

14,885
3
41
52

Decrementing NF is undefined behavior per POSIX so YMMV depending on what each awk implementer decided to do with that instruction. Ditto for `pos + ++skip` as the white space wont dictate the outcome. It's obviously inefficient to always loop through all fields and decide field by field to print it or not (and even more inefficient to recompile the record each time a field matches and then resplit the record at the end!) vs just deciding once which field numbers to print and then printing those for each line. – Ed Morton May 16 '18 at 16:43
@EdMorton Could you provide a reference to the undefinedness of modifying `NF`? It's not a reserved word in the grammar. The `++` issue is easily dealt with. The efficiency is another matter. – Kusalananda May 16 '18 at 16:50
@EdMorton I've been struggling to find a way to assemble output in `awk` in a way the _looks_ good. Having to use `out = (out ? out OFS field : field)` is so terribly ugly. And using `printf` in a similar way while testing whether OFS should be inserted or not, is also just very inelegant. I'm lacking Perl's `join()` :-) – Kusalananda May 16 '18 at 17:02
Wrt NF - it's only decrementing it that's undefined, incrementing it is defined. You could read the POSIX standard and note that what happens when you increment NF is defined but decrementing it isn't but otherwise idk how to provide a standards reference to something that simply doesn't exist in the standards. The gawk manual does say `CAUTION: Some versions of awk don’t rebuild $0 when NF is decremented.` so maybe that's adequate? – Ed Morton May 16 '18 at 17:15
`out = (out ? out OFS field : field)` - that should be `out = (out=="" ? "" : out OFS) field` to avoid having to specify `field` twice and so it behaves as intended even if/when `out` evaluates numerically to zero. It's an extremely common ternary expression used in many languages so I'm not seeing the ugliness. Yes, perl has a millions constructs to do trivial things in slightly briefer ways - that's one of the main problems with it. awk by comparison only has constructs to do things that would be difficult/lengthy to do otherwise which is why its such a small, simple, but powerful language. – Ed Morton May 16 '18 at 17:22
@EdMorton I'm still not 100% sure I buy the argument about decrementing `NF` is illegal. Sure, some versions of GNU `awk` may not support it (as per its manual, and this may possibly be a bug), but `mawk` explicitly says `Assignment to NF or to a field causes $0 to be reconstructed`. BSD `awk` supports it too. – Kusalananda May 16 '18 at 18:05
@EdMorton And the standard, as I read it, says that many things are undefined, but does not mention changing `NF` directly (incrementing it or decrementing it), but only gives an example with indirectly incrementing it through assigning to `$(NF+2)` as a way of saying something about what happens with the intervening fields . – Kusalananda May 16 '18 at 18:05
I didn't say decrementing NF is illegal, I said its effect is undefined by POSIX. The gawk manual is talking about other awk implementations, not other gawk versions. The standard DOES state what happens when you increment NF, does NOT state what will happen when you decrement it and here's some evidence on my Mac: `$ awk --version` outputs `awk version 20070501` which when run with `$ echo 'a b c' | awk '{NF--; print}'` outputs `a b c` while the same command with GNU awk `$ echo 'a b c' | gawk '{NF--; print}'` outputs `a b`. Both commands are legal and POSIX-compliant. – Ed Morton May 16 '18 at 18:11
@EdMorton Try `echo 'a b c' | awk '{$1=$1; NF--; print}'`. I won't argue more about this here. We've made our points. – Kusalananda May 16 '18 at 18:17
That is yet another example of BSD awk doing something unique in the face of undefined behavior and still being legal and POSIX-compliant despite being internally inconsistent. There's no argument - decrementing NF is simply undefined behavior so if you do it different awks will behave in different ways (in different contexts apparently!) as shown in the scripts we've both posted. – Ed Morton May 16 '18 at 18:21
@EdMorton I've edited the solution to show that I'm using `mawk`, whose manual explicitly allows for modifying `NF` directly. – Kusalananda May 16 '18 at 18:25

James Brown · Answer 5 · 2018-05-16T16:37:22.837

This one respects the order of columns in file1, changed the order:

$ cat file1
id|country|name

The awk:

$ awk '
BEGIN { FS=OFS="|" }
NR==1 {                                             # file1
    n=split($0,a)
    next
}
NR==2 {                                             # file2 header
    for(i=1;i<=NF;i++)
        b[$i]=i
} 
{                                                   # output part
    for(i=1;i<=n;i++)
        printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd

(Another version using cut for outputing in revisions)

how to find out common columns and its records from two files using awk

5 Answers5