AWK: merge two files based on the overlapping range given in the files

Question

Let me explain my problem using a dummy example. This is file A -

1 10 20 aa
2 30 40 bb
3 60 70 cc
. .. .. ..

and This is file B -

10 15 xx yy mm
21 29 mm nn ss
11 18 rr tt yy
69 90 qq ww ee
.. .. .. .. ..

I am trying to merge these files A and B such that there exist some overlapping between A's row and B's row.

Overlapping between A's row and B's row, in my case: there is something common between range starting from $2 to $3 for A's row and range starting from $1 to $2 for B's row. in above example, there is overlapping between range(10,20) and range(10,15). Here range(10,20) = [10,11,12,13,14,15,16,17,18,19] and range(10,15) = [10,11,12,13,14]

So the expected output is -

1 10 20 aa 10 15 xx
1 10 20 aa 11 18 rr
3 60 70 cc 69 90 qq

I tried this way (using and awk):

    for peak in State.peaks:
        i = peak[-1]
        peak = peak[:-1]
        a = peak[1]
        b = peak[2]
        d = State.delta
        c = ''' awk '{id=%d;delta=%d;a=%d;b=%d;x=%s;y=%s;if((x<=a&&y>a)||(x<=b&&y>b) || (x>a&&y<=b)) print id" "$7" "$3-$2} ' %s > %s ''' % (i, d, a, b, "$2-d", "$3+d", State.fourD, "file"+str(name))
        os.system(c)

Wanted to remove python part completely as it is taking much time.

[Stack Overflow](http://stackoverflow.com/tour) is a question and answer site for professional and enthusiast programmers. Please show your coding efforts. — Cyrus, Mar 17 '18 at 16:06
Why not go the other way and write it all in Python? The part that is taking a long time is the `os.system` call for each line with geometric complexity and slow execution for each `n`. Python has richer set functionality and this is trivial in Python to write; awk can do it, but you will need to write the range overlap functionality from scratch. — dawg, Mar 17 '18 at 16:26
Running `awk` for each `peak` is bound to be slow; running Python from within `awk` would also be horribly slow. Use pure Python or pure Awk; either can do the job very easily, and will be fast. — Jonathan Leffler, Mar 17 '18 at 16:42

karakfa · Answer 1 · 2018-04-20T17:16:10.503

1

awk to the rescue!

$ awk 'function intersect(x1,y1,x2,y2) 
         {return (x1>=x2 && x1<y2) || (x2>=x1 && x2<y1)}

       NR==FNR{lower[$0]=$2; upper[$0]=$3; next}

              {for(k in lower) 
                 if(intersect(lower[k],upper[k],$1,$2)) 
                    print k,$1,$2,$3}' file1 file2

Note that

(x1>=x2 && x1<y2) || (x2>=x1 && x2<y1) 
= [x1>=x2 || (x2>=x1 && x2<y1)]            && [x1<y2 || (x2>=x1 && x2<y1)]
= [(x1>=x2 || x2>=x1) && (x1>=x2 || x2<y1) && [// symmetric 1~2]
= [True               && x2 < max(x1,y1)]  && [// symmetric 1~2]  
= x2<y1 && y2<x1

which is equivalent to @Jonathan Leffler's condition, which is more compact and more efficient, even though not trivial at first sight.

edited Apr 20 '18 at 17:16

answered Mar 17 '18 at 16:34

karakfa

66,216
7
41
56

The fact that range `[60..70)` overlaps with `[69..90)` shows that it is 'intersect' rather than 'contains' logic that is required. Your code incorrectly deems that an entry in file B containing `20 24 ii jj kk` overlaps with the entry in file A `1 10 20 aa`. – Jonathan Leffler Mar 17 '18 at 17:29
Oh yes, upper bound not included! – karakfa Mar 17 '18 at 17:46
As commented in Jonathan reply, this approach taking much time. Could you please give some fast solution, considering the huge size of file. – ipj Mar 18 '18 at 08:20
1

a faster solution is possible if you can sort the files beforehand on lower/upper bounds. – karakfa Mar 18 '18 at 11:35
FYI: I ran a test over a large number of sample ranges (2400, to be precise), and compared the results of two Awk functions: (yours) `function intersect_k(x1,y1,x2,y2) {return (x1>=x2 && x1=x1 && x2 – Jonathan Leffler Mar 19 '18 at 04:33
Yes, they are equivalent. Takes a minute to understand it though. – karakfa Mar 19 '18 at 14:22

score 0 · Accepted Answer · answered Mar 17 '18 at 17:14

0

This Awk script does the job:

NR == FNR { record[NR] = $0; lo[NR] = $2; hi[NR] = $3; nrecs = NR; next }
NR != FNR { # Overlap:  lo[A] < hi[B] && lo[B] < hi[A]
            for (i = 1; i <= nrecs; i++)
            {
                if (lo[i] < $2 && $1 < hi[i])
                    print record[i], $1, $2, $3
            }
          }

I saved it as range-merge-53.awk (53 is simply a random double-digit prime). I created file.A and file.B from your sample data, and ran:

$ awk -f range-merge-53.awk file.A file.B
1 10 20 aa 10 15 xx
1 10 20 aa 11 18 rr
3 60 70 cc 69 90 qq
$

The key is the 'overlap' condition, which must exclude the high value of each range — often denoted [lo..hi) for an open-closed range.

It would be possible to omit either the next or the NR != FNR condition (but not both) and the code would work as well.

See also Determine whether two date ranges overlap — the logic of ranges applies to dates and integers and floating point, etc.

answered Mar 17 '18 at 17:14

Jonathan Leffler

730,956
141
904
1,278

it is working perfectly fine but taking a lot of time. These 2 files are very large in size. Could you please suggest some fast approach. – ipj Mar 18 '18 at 08:19
Is the data sorted? The code reads all of one file into memory; make sure it is the smaller one. How large is very large? Megabytes, Gigabytes, Terabytes, bigger? The slow part is cycling through every entry in the first file to compare the current record from the second file with it. If there's some way to cut down on the number of such comparisons (because you can tell there'll never be another match for the current record, because the data is sorted), then you can probably speed things up. – Jonathan Leffler Mar 18 '18 at 12:24
Sizes are in the order of Mb. Yes, It is possible to have sorted files on the basis of lower. What changes would be required in command to accommodate sorted files? – ipj Mar 18 '18 at 18:05
Once you start optimizing, lots of nitty-gritty details start to matter. For example, your sample data shows 2-digit values, and 2-letter codes. For the MCVE, they're great — thank you! But for optimizing, it matters what the real values look like ("look like" is sufficient; we don't need to see actual values, just what they look like). Are the ranges fairly small, or can they be big? For example, could file A have an entry `13 123456 987654 peritoneum`? How many rows are in file A and in file B? How long is an average row in file A and in file B? How big can the range numbers get? – Jonathan Leffler Mar 19 '18 at 04:04
Also of relevance is "how compact are the ranges"? If the ranges are fairly small (say up to 3 digits difference) but the total value range is large (say 6-9 digits worth), then it is easier to eliminate the irrelevant than if any range could have up to 9 digits (out of 9 maximum). If your numbers are bigger still, please say so. The objective must be to avoid looking at data in memory — the cached data from reading file A in the answers. And if we know a lot more about the possible data controlling the ranges, we may be able to optimize better. – Jonathan Leffler Mar 19 '18 at 04:08
And, just case you haven't already spotted it, this is probably material for a separate question — one that would refer to this question for the simplified version, but that would detail the sizes as requested so that people can help. If you ask a new question, please post a link to it in a comment to my answer (so I see it) and in a comment to karafka's answer (so he sees it). – Jonathan Leffler Mar 19 '18 at 04:12

AWK: merge two files based on the overlapping range given in the files

2 Answers2