18

I have a name.txt file of one column, e.g.

A
B
C
D
E
F

Then I have many files, e.g. x.txt, y.txt and z.txt

x.txt has

A 1
C 3
D 2

y.txt has

A 1
B 4
E 3

z.txt has

B 2
D 2
F 1

The desirable output is (filling in 0 if there is no mapping)

A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1

Is it possible to make it with bash? (perhaps awk?)
Many thanks!!!


first edits - my tentative efforts
Since I am quite new to bash, it is really hard for me to figure out a possible solution with awk. I'm more familiar with R, in which this can be accomplished by

namematrix[namematrix[,1]==xmatrix[,1],]

All in all, I really appreciate the kind help below helping me learn more about awk and join!


Second-time edits - a super efficient approach figured out!

Luckily inspired by some really brilliant answers below, I have sorted out a very computationally efficient way as below. This may be helpful to other people encountering similar questions, in particular if they deal with very large number of files with very large size.

Firstly touch a join_awk.bash

#!/bin/bash
join -oauto -e0 -a1 $1 $2 | awk '{print $2}'

For instance, execute this bash script for name.txt and x.txt

join_awk.bash name.txt x.txt

would generate

1
0
3
2
0
0

Note that here I only keep the second column to save disk space, because in my dataset the first columns are very long names that would take tremendous disk space.

Then simply implement

parallel join_awk.bash name.txt {} \> outdir/output.{} ::: {a,b,c}.txt

This is inspired by the brilliant answer below using GNU parallel and join. The difference is that the answer below has to specify j1 for parallel due to its serial appending logic, which makes it not really "parallel". Also, the speed will become slower and slower as the serial appending continues. In contrast, here we manipulate each file separately in parallel. It can be extremely fast when we deal with large number of large-size files with multiple CPUs.

Finally just merge all the single-column output files together by

cd outdir
paste output* > merged.txt

This will be also very fast since paste is inherently parallel.

Elfxy
  • 183
  • 6

7 Answers7

12

You may use this awk:

awk 'NF == 2 {
   map[FILENAME,$1] = $2
   next
}
{
   printf "%s", $1
   for (f=1; f<ARGC-1; ++f)
      printf "%s", OFS map[ARGV[f],$1]+0
   print ""
}' {x,y,z}.txt name.txt
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
8

Adding one more way of doing it. Could you please try following, written and tested with shown samples. IMHO should work in any awk, though I only have 3.1 version of GNU awk only. This is very simple and usual way, create an array in first(major) Input_file's reading then later on in each file add 0 whoever element of that array is NOT found in that specific Input_file, tested with small given samples only.

awk '
function checkArray(array){
  for(i in array){
    if(!(i in found)){ array[i]=array[i] OFS "0" }
  }
}
FNR==NR{
  arr[$0]
  next
}
foundCheck && FNR==1{
 checkArray(arr)
  delete found
  foundCheck=""
}
{
  if($1 in arr){
    arr[$1]=(arr[$1] OFS $2)
    found[$1]
    foundCheck=1
    next
  }
}
END{
  checkArray(arr)
  for(key in arr){
    print key,arr[key]
  }
}
' name.txt x.txt y.txt  z.txt

Explanation: Adding detailed explanation for above.

awk '                               ##Starting awk program from here.
function checkArray(array){         ##Creating a function named checkArray from here.
  for(i in array){                  ##CTraversing through array here.
    if(!(i in found)){ array[i]=array[i] OFS "0" }   ##Checking condition if key is NOT in found then append a 0 in that specific value.
  }
}
FNR==NR{                            ##Checking condition if FNR==NR which will be TRUE when names.txt is being read.
  arr[$0]                           ##Creating array with name arr with index of current line.
  next                              ##next will skip all further statements from here.
}
foundCheck && FNR==1{               ##Checking condition if foundCheck is SET and this is first line of Input_file.
 checkArray(arr)                    ##Calling function checkArray by passing arr array name in it.
  delete found                      ##Deleting found array to get rid of previous values.
  foundCheck=""                     ##Nullifying foundCheck here.
}
{
  if($1 in arr){                    ##Checking condition if 1st field is present in arr.
    arr[$1]=(arr[$1] OFS $2)        ##Appening 2nd field value to arr with index of $1.
    found[$1]                       ##Adding 1st field to found as an index here.
    foundCheck=1                    ##Setting foundCheck here.
    next                            ##next will skip all further statements from here.
  }
}
END{                                ##Starting END block of this program from here.
  checkArray(arr)                   ##Calling function checkArray by passing arr array name in it.
  for(key in arr){                  ##Traversing thorugh arr here.
    print key,arr[key]              ##Printing index and its value here.
  }
}
' name.txt x.txt y.txt z.txt        ##Mentioning Input_file names here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
7

Yes, you can do it, and yes, awk is the tool. Using arrays and your normal file line number (FNR file number of records) and total lines (NR records) you can read all letters from names.txt into the a[] array, then keeping track of the file number in the variable fno, you can add all the additions from x.txt and then before processing the first line of the next file (y.txt), loop over all letters seen in the in the last file, and for those not seen place a 0, then continue processing as normal. Repeat for each additional file.

Further line-by-line explanation is shown in the comments:

awk '
    FNR==NR {                           # first file
        a[$1] = ""                      # fill array with letters as index
        fno = 1                         # set file number counter
        next                            # get next record (line)
    }
    FNR == 1 { fno++ }                  # first line in file, increment file count
    fno > 2 && FNR == 1 {               # file no. 3+ (not run on x.txt)
        for (i in a)                    # loop over letters 
            if (!(i in seen))           # if not in seen array
                a[i] = a[i]" "0         # append 0
        delete seen                     # delete seen array
    }
    $1 in a {                           # if line begins with letter in array
        a[$1] = a[$1]" "$2              # append second field
        seen[$1]++                      # add letter to seen array
    }
END {
    for (i in a)                        # place zeros for last column
        if (!(i in seen))
            a[i] = a[i]" "0
    for (i in a)                        # print results
        print i a[i]
}' name.txt x.txt y.txt z.txt

Example Use/Output

Just copy the above, and middle-mouse-paste into an xterm with the current directory containing your files and you will receive:

A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1

Creating a Self-Contained Script

If you would like to create a script to run instead of pasting at the command line, you simply include the contents (without surrounding in single-quotes) and then make the file executable. For example, you include the interpreter as the first line and the contents as follows:

#!/usr/bin/awk -f

FNR==NR {                           # first file
    a[$1] = ""                      # fill array with letters as index
    fno = 1                         # set file number counter
    next                            # get next record (line)
}
FNR == 1 { fno++ }                  # first line in file, increment file count
fno > 2 && FNR == 1 {               # file no. 3+ (not run on x.txt)
    for (i in a)                    # loop over letters 
        if (!(i in seen))           # if not in seen array
            a[i] = a[i]" "0         # append 0
    delete seen                     # delete seen array
}
$1 in a {                           # if line begins with letter in array
    a[$1] = a[$1]" "$2              # append second field
    seen[$1]++                      # add letter to seen array
}
END {
    for (i in a)                    # place zeros for last column
        if (!(i in seen))
            a[i] = a[i]" "0
    for (i in a)                    # print results
        print i a[i]
}

awk will process the filenames given as arguments in the order given.

Example Use/Output

Using the script file (I put it in names.awk and then used chmod +x names.awk to make it executable), you would then do:

$ ./names.awk name.txt x.txt y.txt z.txt
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1

Let me know if you have further questions.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • Thank you so much for your effective solutions and patient explanations - It really helps me a lot as a beginner learning awk and bash! :) – Elfxy Nov 24 '20 at 20:44
  • You are very welcome. `awk` it truly the Swiss-Army Knife of text processing. The time you spend learning will be paid back 10-fold in time saved later. The [GNU Awk User's Guide](https://www.gnu.org/software/gawk/manual/html_node/index.html#SEC_Contents) is a great resource for learning. – David C. Rankin Nov 24 '20 at 20:48
4

Another approach with GNU awk

$ cat script.awk
NF == 1 {
    name[$1] = $1
    for (i = 1; i < ARGC - 1; i++) {
        name[$1] = name[$1] " 0"
    }
    next
}

{
    name[$1] = gensub(/ ./, " " $2, ARGIND - 1, name[$1])
}

END {
    for (k in name) {
        print name[k]
    }
}

Calling the script:

$ awk -f script.awk name.txt {x,y,z}.txt
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1

The output shows same order as name.txt, but I don't think that'll be true for all kind of input.

Sundeep
  • 23,246
  • 2
  • 28
  • 103
3

This might work for you (GNU parallel and join):

cp name.txt out && t=$(mktemp) &&
parallel -j1 join -oauto -e0 -a1 out {} \> $t \&\& mv $t out ::: {x,y,z}.txt

Output will be in file out.

potong
  • 55,640
  • 6
  • 51
  • 83
  • Very nice solution! extremely neat and effective! A prerequisite is that all the files should be sorted :) Thank you so much. – Elfxy Nov 24 '20 at 20:42
2

You can use join

join -a1 -e0 -o '0,2.2' name.txt x.txt | join -a1 -e0 -o '0,1.2,2.2' - y.txt | join -a1 -e0 -o '0,1.2,1.3,2.2' - z.txt
Diego Torres Milano
  • 65,697
  • 9
  • 111
  • 134
1

With bash how about:

#!/bin/bash

declare -A hash                                 # use an associative array
for f in "x.txt" "y.txt" "z.txt"; do            # loop over these files
    while read -r key val; do                   # read key and val pairs
        hash[$f,$key]=$val                      # assign the hash to val
    done < "$f"
done

while read -r key; do
    echo -n "$key"                              # print the 1st column
    for f in "x.txt" "y.txt" "z.txt"; do        # loop over the filenames
        echo -n " ${hash[$f,$key]:-0}"          # print the associated value or "0" if undefined
    done
    echo                                        # put a newline
done < "name.txt"
tshiono
  • 21,248
  • 2
  • 14
  • 22
  • 1
    That will take forever to run for reasonably large input files and has no benefit over an awk script that's briefer and will run in a fraction of the time. – Ed Morton Nov 24 '20 at 15:54