2

I have two tab separated files of 1708 rows and different number of columns. My goal is to compare the value stored for all rows but only some specific columns. I have two lists containing the columns' number that I want to compare; here an example:

  • FileA ➝ col_ind_A = [12,20,24,55]
  • FileB ➝ col_ind_B = [14,28,35,79]

Here, column 12 of file A should be compared with column 14 of file B, 20 of fileA with 28 of fileB and so on. If file A has value 0 and file B doesn't, I want to modify file C (a copy of file A) in that position, and then store the value of file B (which is not 0):

# FileA                     #FileB                     #FileC
col11 col12 col13           col13 col14 col15          col11 col12 col13
  A     C     G               A      C     G            A      C     G
  G     0     T               G      T     T            G      T     T 

I've seen that comparing columns is usually done with awk, but I'm quite new to bash and I don't know how to iterate over the rows of the two files while I iterate over the col_ind lists and indicate the column positions that I want to compare. Any suggestions are be welcome.

If it's of any help, I show an R code that does exactly this (it is just too slow):

for(i in 1:1708){ #rows
  for(j in 1:31946){ #cols 
    if( fileA[i, col_ind_A[j]] == '0'  && fileA[i, col_ind_A[j]] != fileB[ i, col_ind_B[j]]){ 
      fileC[i, col_ind_A[j]] <- fileB[i, col_ind_B[j]] # write value from fileB in file C 
    }
  }
}

Any help would be great. Thanks!!

sarsanher
  • 23
  • 3
  • 1
    Can you please share sample records of your input file. – Digvijay S Mar 09 '20 at 08:51
  • [How to read multiple file in awk](https://stackoverflow.com/questions/14984340/using-awk-to-process-input-from-multiple-files) – Digvijay S Mar 09 '20 at 08:53
  • [How to loop array in awk](https://stackoverflow.com/questions/3060600/awk-array-iteration-for-multi-dimensional-arrays) – Digvijay S Mar 09 '20 at 08:54
  • Do you want to do this with `awk` because the _R code_ _is just too slow_? If so, you're misguided and would be better advised to ask for a better R code. – Armali Mar 09 '20 at 08:54
  • Not bash, because it will slower than your R code most probably. – Jetchisel Mar 09 '20 at 08:55
  • If bash is slower than R, is there a faster alternative? My program will be running for a week at this speed... – sarsanher Mar 09 '20 at 09:22
  • 1
    I'd convert files to SQLite or other db. Index the relevant columns for speed, and ask the DB engine to process a query. Eventually re format the final output to be like the data files. – Léa Gris Mar 09 '20 at 09:28
  • 1
    @Armali you're right, I just fixed this, thanks. As for a sample of my records, they are .ped files created using PLINK, they look exactly as in the example but with more rows and columns. I hope that helps. – sarsanher Mar 09 '20 at 11:14
  • `20A with 28B` -what is 20A and 28B?? `I want to modify file C (a copy of file A) ` - wouldn't it be simpler to just generate fileC, if it's a copy? – KamilCuk Mar 09 '20 at 12:21
  • @KamilCuk '''what is 20A and 28B''' means column 20 of fileA with column 28 of fileB (I just changed this in the description). I also changed the row number, it was 1708. Thanks! As for the fileC, it should be just like A BUT with the 0 values substituted by whatever fileB has it that position (the correspondent column). I just don't want to modify the original file, thats why I say it is a copy. – sarsanher Mar 09 '20 at 14:00

3 Answers3

2

A perl script that does it:

#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw/say/;
use List::Util qw/pairs/;

# Adjust as needed.
my @columns = (12 => 14, 20 => 28, 24 => 35, 55 => 79);

my ($filea_name, $fileb_name) = @ARGV;

@columns = pairs map { $_ - 1 } @columns;
open my $filea, '<', $filea_name;
open my $fileb, '<', $fileb_name;
$, = " "; # Or "\t" or whatever to delimit output columns
while (my $linea = <$filea>) {
    my $lineb = <$fileb> or die "Files have different line counts\n";
    chomp $linea;
    chomp $lineb;
    my @acols = split ' ', $linea;
    my @bcols = split ' ', $lineb;
    for my $p (@columns) {
        if ($acols[$$p[0]] eq "0" && $bcols[$$p[1]] ne "0") {
            $acols[$$p[0]] = $bcols[$$p[1]];
        }
    }
    say @acols;
}

(Takes FileA and FileB as its command line arguments)

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • 1
    This looks really good, would it be possible to add the columns also as a txt file containing ‘’’12 => 14, 20 => 28, 24 => 35, 55 => 79, ...’’’ because the number of correspondences is over 30 thousand. Thanks! – sarsanher Mar 09 '20 at 15:22
0

First join the files line by line, then just check the condition you want to check.

# recreate input
cat >file1 <<EOF
col11 col12 col13
A C G
G 0 T
EOF
cat >file2 <<EOF
col13 col14 col15
A C G
G T T
EOF

paste file1 file2 |
awk '{ if ($2 == 0 && $2 != $6) $2=$6; print $1, $2 ,$3}'

outputs:

col11 col12 col13
A C G
G T T

I guess from for(i in 1:1708){ #rows maybe you want to iterate over all columns, assuming there is the same number of columns in both files:

paste file1 file2 |
awk '{ 
   for (i=1;i<=NF/2;++i) if ($i == 0 && $i != $(i*2)) $i = $(i*2);
   for (i=1;i<=NF/2;++i) printf "%s%s", $i, i==NF/2?ORS:OFS; 
}'
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • Regarding _assuming there is the same number of columns in both files_ - the querist wrote: _I have two tab separated files of 1708 rows and different number of columns._ – Armali Mar 09 '20 at 14:16
  • 1
    I think joining the files might not be the best idea since fileA has over 180000 columns and fileB has more than 30000, but thanks anyways, the awk bit was very useful :) – sarsanher Mar 09 '20 at 21:37
0

Since you asked for an awk solution, here's a straightforward one:

awk -v col_ind_A='12 20 24 55' -v col_ind_B='14 28 35 79' '
BEGIN { OFS="\t"
        split(col_ind_A, ciA)
        split(col_ind_B, ciB)
        while (getline <"FileB" > 0 && split($0, B) && getline <"FileA" > 0)
        {
            for (i in ciA) if ($ciA[i] == 0) $ciA[i] = B[ciB[i]]
            print >"FileC"
        }
      }'

But this won't be faster than the R code. An optimization step for the R code would probably be to eliminate the inner loop:

for (i in 1:nrow(FileA))
{
    j = which(FileA[i, col_ind_A] == 0)
    FileC[i, col_ind_A[j]] = FileB[i, col_ind_B[j]]
}
Armali
  • 18,255
  • 14
  • 57
  • 171