0

I have a list of ranges, and I am trying to merge entries which lie within a given distance of each other.

In my data, the second column contains the lower bound of the range and the third column contains the upper bound. The logic follows: if the value in column 2 is less than or equal to the value in column 3 of any other row plus a given value, print the entry in column 2 of the prior row and the entry in column 3 of the given row.

If the two ranges lie within the distance specified by the variable 'dist', they should be merged, else the rows should be printed as they are. If the row resulting from a merge lies within 'dist' of any other row, these should also be merged.

I would like this to be done only for rows in which the first column matches.

Input:

1 1 9
1 10 19
1 30 39
2 40 49
2 50 59
2 60 69

if dist=10, desired output:

1 1 19
1 30 39
2 40 69

Using awk, I've tried things along these lines:

awk -v dist=10 'NR=FNR { a[FNR] = $1; b[FNR] = $2; c[FNR] = $3; next; }
    {
        for (i in a)
            if ($1 == a[i]) {
                    for (i in c)
                            if ($2 <= (c[i]+dist) {
                                    print c[i], $2; }
                            else {
                                    print $1, $2; }
            }
     }' infile

This returns syntax errors.

Any help appreciated!

AndreaT
  • 367
  • 2
  • 10
Marla
  • 340
  • 3
  • 16
  • 5
    How is this different from your last 2 questions, https://stackoverflow.com/questions/46524900/compare-different-columns-of-subsequent-rows-to-merge-ranges and https://stackoverflow.com/questions/46033946/how-to-compare-2-lists-of-ranges-in-bash? Also rather than saying "This returns syntax errors." and leaving us to try to figure out what those might be, why not just include the syntax errors in your question? Finally - format your code, input, and output using the editors `{}` button (or just indent each line 4 spaces manually) just like in your previous questions. – Ed Morton Oct 12 '17 at 14:55
  • It is different because the first question compared whether the values of 2 ranges overlap, the second question compared subsequent lines which lie within a given distance, and the present question compares all lines of ranges with lie within a given distance. – Marla Oct 13 '17 at 06:44
  • These are the errors that this version of the script returns: if ($2 <= (c[i]+dist) { ^ syntax error else { ^ syntax error – Marla Oct 13 '17 at 06:47
  • You really can't spot the syntax error in `if ($2 <= (c[i]+dist) {`? Think about it for a minute. – Ed Morton Oct 13 '17 at 13:53
  • A missing bracket? I changed it to if '($2 <= (c[i]+dist)) {' and now the script is returning nothing. – Marla Oct 13 '17 at 14:22
  • Yes, a missing bracket. Fixing a syntax error doesn't mean your script will do what you want, but it's a start and now you have a different question to ask. Hint: `NR=FNR` != `NR==FNR`. syntax matters, see the answers to your previous questions. – Ed Morton Oct 13 '17 at 14:24
  • It turns out there's a tool to merge windows in genomic data... yay! http://bedtools.readthedocs.io/en/latest/content/tools/merge.html – Marla Oct 19 '17 at 13:20

0 Answers0