0

Good morning,

I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following: I have these two files:

head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095

and

head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196

I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.

Here is the code I've been working with (although a simpler solution would be welcomed):

while read interval; do
    chr=$(echo $interval | awk '{print $1}')
    START=$(echo $interval | awk '{print $2}')
    STOP=$(echo $interval | awk '{print $3}')

    awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
    mv tmp <input file>
done <

My problem is that no lines are removed from the input file. Even if the command

awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l

returns >4000 lines, so the lines clearly are in the input file.

Thank you very much for your help.

Sigurgeir
  • 365
  • 2
  • 12
  • I think you would be quicker and easier doing the whole lot in `awk` though I cannot tell from your description what it is that you are trying to do. Have a look at my answer here to see how to read 2 files with `awk` http://stackoverflow.com/a/22416432/2836621 – Mark Setchell Jun 30 '15 at 09:23
  • Thank you @TomFenech. I have tried the command `awk -v start="$START" -v stop="$STOP" '$2!=$chr {print} $2==$chr && ($3stop) {print}' < input_file > tmp` but without success. Would you be so kind as to explain further how I might implement your solution. @MarkSetchell, I would very much like to do the whole thing in awk. To clear up my description I would like a line removed if the CHR field matches $1 of the interval_file and the number in the BP column is between the numbers in $2 and $3 of the interval_file (not necessarily exactly matching the number) – Sigurgeir Jun 30 '15 at 10:12
  • You need to pass `$chr` in the same way :) – Tom Fenech Jun 30 '15 at 10:14
  • Ofcourse... I'm sorry. this worked like a charm. Thanks for your help – Sigurgeir Jun 30 '15 at 10:21

1 Answers1

0

You can try with instead of . The reason is that in you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:

perl -lane '
    $. == 1 && next;
    @F == 3 && do {
        push @{$h{$F[0]}}, [@F[1..2]];
        next;
    };
    @F == 7 && do {
        $ok = 1;
        if (exists $h{$F[1]}) {
            for (@{$h{$F[1]}}) {
                if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
                    $ok = 0;
                    last;
                }
            }
        }
        printf qq|%s\n|, $_ if $ok;
    };
' interval input

$. skips header of interval file. @F checks number of columns and the push creates the hash of arrays.

Your test data is not accurate because none line is filtered out, I changed it to:

SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095

So you can run it and get as result:

SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
Birei
  • 35,723
  • 2
  • 77
  • 82
  • This works very well. However, since I lack experience in using pearl I think I will go for the solution suggested in the comments to my question. I'm accepting this answer as the other solution was only posted as a comment. – Sigurgeir Jun 30 '15 at 10:24