I have a list of ranges, and I am trying to merge entries which lie within a given distance of each other.
In my data, the second column contains the lower bound of the range and the third column contains the upper bound. The logic follows: if the value in column 2 is less than or equal to the value in column 3 of any other row plus a given value, print the entry in column 2 of the prior row and the entry in column 3 of the given row.
If the two ranges lie within the distance specified by the variable 'dist', they should be merged, else the rows should be printed as they are. If the row resulting from a merge lies within 'dist' of any other row, these should also be merged.
I would like this to be done only for rows in which the first column matches.
Input:
1 1 9
1 10 19
1 30 39
2 40 49
2 50 59
2 60 69
if dist=10, desired output:
1 1 19
1 30 39
2 40 69
Using awk, I've tried things along these lines:
awk -v dist=10 'NR=FNR { a[FNR] = $1; b[FNR] = $2; c[FNR] = $3; next; }
{
for (i in a)
if ($1 == a[i]) {
for (i in c)
if ($2 <= (c[i]+dist) {
print c[i], $2; }
else {
print $1, $2; }
}
}' infile
This returns syntax errors.
Any help appreciated!