Bash way to compare specific columns from two different files based on an index list

Question

I have two tab separated files of 1708 rows and different number of columns. My goal is to compare the value stored for all rows but only some specific columns. I have two lists containing the columns' number that I want to compare; here an example:

FileA ➝ col_ind_A = [12,20,24,55]
FileB ➝ col_ind_B = [14,28,35,79]

Here, column 12 of file A should be compared with column 14 of file B, 20 of fileA with 28 of fileB and so on. If file A has value 0 and file B doesn't, I want to modify file C (a copy of file A) in that position, and then store the value of file B (which is not 0):

# FileA                     #FileB                     #FileC
col11 col12 col13           col13 col14 col15          col11 col12 col13
  A     C     G               A      C     G            A      C     G
  G     0     T               G      T     T            G      T     T

I've seen that comparing columns is usually done with awk, but I'm quite new to bash and I don't know how to iterate over the rows of the two files while I iterate over the col_ind lists and indicate the column positions that I want to compare. Any suggestions are be welcome.

If it's of any help, I show an R code that does exactly this (it is just too slow):

for(i in 1:1708){ #rows
  for(j in 1:31946){ #cols 
    if( fileA[i, col_ind_A[j]] == '0'  && fileA[i, col_ind_A[j]] != fileB[ i, col_ind_B[j]]){ 
      fileC[i, col_ind_A[j]] <- fileB[i, col_ind_B[j]] # write value from fileB in file C 
    }
  }
}

Any help would be great. Thanks!!

[How to read multiple file in awk](https://stackoverflow.com/questions/14984340/using-awk-to-process-input-from-multiple-files) — Digvijay S, Mar 09 '20 at 08:53
[How to loop array in awk](https://stackoverflow.com/questions/3060600/awk-array-iteration-for-multi-dimensional-arrays) — Digvijay S, Mar 09 '20 at 08:54
Do you want to do this with `awk` because the _R code_ _is just too slow_? If so, you're misguided and would be better advised to ask for a better R code. — Armali, Mar 09 '20 at 08:54
Not bash, because it will slower than your R code most probably. — Jetchisel, Mar 09 '20 at 08:55
If bash is slower than R, is there a faster alternative? My program will be running for a week at this speed... — sarsanher, Mar 09 '20 at 09:22
I'd convert files to SQLite or other db. Index the relevant columns for speed, and ask the DB engine to process a query. Eventually re format the final output to be like the data files. — Léa Gris, Mar 09 '20 at 09:28
@Armali you're right, I just fixed this, thanks. As for a sample of my records, they are .ped files created using PLINK, they look exactly as in the example but with more rows and columns. I hope that helps. — sarsanher, Mar 09 '20 at 11:14
`20A with 28B` -what is 20A and 28B?? `I want to modify file C (a copy of file A) ` - wouldn't it be simpler to just generate fileC, if it's a copy? — KamilCuk, Mar 09 '20 at 12:21
@KamilCuk '''what is 20A and 28B''' means column 20 of fileA with column 28 of fileB (I just changed this in the description). I also changed the row number, it was 1708. Thanks! As for the fileC, it should be just like A BUT with the 0 values substituted by whatever fileB has it that position (the correspondent column). I just don't want to modify the original file, thats why I say it is a copy. — sarsanher, Mar 09 '20 at 14:00

score 2 · Answer 1 · answered Mar 09 '20 at 12:06

A perl script that does it:

#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw/say/;
use List::Util qw/pairs/;

# Adjust as needed.
my @columns = (12 => 14, 20 => 28, 24 => 35, 55 => 79);

my ($filea_name, $fileb_name) = @ARGV;

@columns = pairs map { $_ - 1 } @columns;
open my $filea, '<', $filea_name;
open my $fileb, '<', $fileb_name;
$, = " "; # Or "\t" or whatever to delimit output columns
while (my $linea = <$filea>) {
    my $lineb = <$fileb> or die "Files have different line counts\n";
    chomp $linea;
    chomp $lineb;
    my @acols = split ' ', $linea;
    my @bcols = split ' ', $lineb;
    for my $p (@columns) {
        if ($acols[$$p[0]] eq "0" && $bcols[$$p[1]] ne "0") {
            $acols[$$p[0]] = $bcols[$$p[1]];
        }
    }
    say @acols;
}

(Takes FileA and FileB as its command line arguments)

This looks really good, would it be possible to add the columns also as a txt file containing ‘’’12 => 14, 20 => 28, 24 => 35, 55 => 79, ...’’’ because the number of correspondences is over 30 thousand. Thanks! — sarsanher, Mar 09 '20 at 15:22

KamilCuk · Answer 2 · 2020-03-09T12:26:44.210

0

First join the files line by line, then just check the condition you want to check.

# recreate input
cat >file1 <<EOF
col11 col12 col13
A C G
G 0 T
EOF
cat >file2 <<EOF
col13 col14 col15
A C G
G T T
EOF

paste file1 file2 |
awk '{ if ($2 == 0 && $2 != $6) $2=$6; print $1, $2 ,$3}'

outputs:

col11 col12 col13
A C G
G T T

I guess from for(i in 1:1708){ #rows maybe you want to iterate over all columns, assuming there is the same number of columns in both files:

paste file1 file2 |
awk '{ 
   for (i=1;i<=NF/2;++i) if ($i == 0 && $i != $(i*2)) $i = $(i*2);
   for (i=1;i<=NF/2;++i) printf "%s%s", $i, i==NF/2?ORS:OFS; 
}'

edited Mar 09 '20 at 12:26

answered Mar 09 '20 at 12:21

KamilCuk

120,984
8
59
111

Regarding _assuming there is the same number of columns in both files_ - the querist wrote: _I have two tab separated files of 1708 rows and different number of columns._ – Armali Mar 09 '20 at 14:16
1

I think joining the files might not be the best idea since fileA has over 180000 columns and fileB has more than 30000, but thanks anyways, the awk bit was very useful :) – sarsanher Mar 09 '20 at 21:37

score 0 · Accepted Answer · answered Mar 09 '20 at 14:13

Since you asked for an awk solution, here's a straightforward one:

awk -v col_ind_A='12 20 24 55' -v col_ind_B='14 28 35 79' '
BEGIN { OFS="\t"
        split(col_ind_A, ciA)
        split(col_ind_B, ciB)
        while (getline <"FileB" > 0 && split($0, B) && getline <"FileA" > 0)
        {
            for (i in ciA) if ($ciA[i] == 0) $ciA[i] = B[ciB[i]]
            print >"FileC"
        }
      }'

But this won't be faster than the R code. An optimization step for the R code would probably be to eliminate the inner loop:

for (i in 1:nrow(FileA))
{
    j = which(FileA[i, col_ind_A] == 0)
    FileC[i, col_ind_A[j]] = FileB[i, col_ind_B[j]]
}

`getline <"FileB" > 0` should be `(getline <"FileB") > 0`. Ditto for FileA. — Ed Morton, Mar 09 '20 at 16:56
The R code worked perfectly and really fast. Thank you very much!! — sarsanher, Mar 09 '20 at 21:34

Bash way to compare specific columns from two different files based on an index list

3 Answers3