Merge two files under a certain condition

Question

File 1 (big file):

A B C TIMESTAMP  
4 4 4 12344653380  
5 5 5 12344653401  
6 6 6 12344653411  
7 7 7 12344653433  
8 8 8 12344653460  
9 9 9 12344653466

File 2 (small file)(samples from every 30 sec):

D E F TIMESTAMP  
1 1 1 12344653400  
2 2 2 12344653430  
3 3 3 12344653460

File 3:

merge every line where timestamp is between the interval of TIMESTAMP on file 2

A B C TIMESTAMP   D E F   
5 5 5 12344653401 1 1 1  
6 6 6 12344653411 1 1 1  
7 7 7 12344653433 2 2 2  
8 8 8 12344653460 3 3 3

I have found lots of answers like these How to merge two files using AWK? but they always focus on a match.

What does _where timestamp is between the interval of TIMESTAMP on file 2_ mean? — James Brown, Dec 22 '16 at 07:14
What I am trying to do here is to get all tuples between the range of TIMESTAMP in file2 and merge with file 1, but the condition is that if I have a tuple in file 1 that has a TIMESTAMP <= the TIMESTAMP of file2 I will leave it out of file3. If the next tuple of file1 is >= the first TIMESTAMP of file2 and < the next one so it will be merged and written in file 3 and so on. — gabriel32, Dec 24 '16 at 18:54

score 3 · Answer 1 · edited Jan 17 '17 at 18:55

Could you please try following and let me know if this helps you.

awk 'FNR==NR && FNR>1{MIN=MIN>$NF?NF:(MIN?MIN:$NF);MAX=MAX>$NF?MAX:$NF;next} ($NF>=MIN && $NF<=MAX)' file2 file1

You could take above's output into file3.

Explanation of awk command above:

awk 'FNR==NR && FNR>1{

FNR and NR are 2 awk's built-in variables which will let us know the number of lines for any file, so the NR variable's value will be increasing until the last file being read. As we know, awk can read multiple files at a time, but unlike NR, FNR's value will be reset at the beginning of each new file being read. I give the condition FNR==NR which means this condition will only be TRUE when first file is being read. Additionally, I am using FNR>1 to make sure file 2's first line shouldn't be read -- as it has headers in it instead of timings. So, when both of these are TRUE the next actions should take place.

MIN=MIN>$NF?NF:(MIN?MIN:$NF);

Create a variable named MIN which has a condition if MIN's value is greater than $NF (which is last field in awk, where $ refers to the field's value and NF refers to the number of fields). If this condition is TRUE then the action after wild character ? will be done, which is changing MIN's value to $NF's. If that condition is NOT TRUE then the action/condition after : will be executed. So, here again there is one condition which checks if MIN is NULL then keeps it's value to $NF else keep MIN value as it is.

MAX=MAX>$NF?MAX:$NF;

Creating a variable named MAX, and add a condition to check if MAX's value is greater than $NF's value, if this is TRUE then action after ?, which is to keep MAX's value as it is, will be executed. If FALSE then the action after :, which is to change it's value to $NF, will be executed.

next}

Using awk's built in keyword next here which means leave all next statements further so awk's cursor will move to the beginning of the statements.

($NF>=MIN && $NF<=MAX)

Now we need to check a condition (which will be only executed when first file named file2 is being read completely), as described above the next statement, this will not allow it to execute this condition. Check here if $NF's value is greater or equal to variable MIN's value and less or equal to MAX's value then print the current line, though there is nothing written to print here. awk works on condition/action, so if a condition is TRUE, then certain actions should be placed. No action is described here, so the default action of printing the current line will happen.

' file2 file1

Mentioning the Input_files named file2 and file1 here.

Welcome to Stack Overflow! To turn this into a great answer, please explain what it's doing too, so that people can learn what's going on, not just a copy/paste solution which works in one case. — clinton3141, Dec 21 '16 at 19:37
Hello iblamefish, Thank you. No sir, this is not a copy/paste answer, I am trying to put explanation here it is telling less chars are only allowed let me try with posting a answer now, apologies for not posting it before. — RavinderSingh13, Dec 21 '16 at 19:54
Thanks a lot for your answer but unfortunately it did not work out for what I need. — gabriel32, Dec 24 '16 at 18:55
@Sean Bright Did you write the entire explanation? If so, this should probably be community wiki... — Questionmark, Jan 17 '17 at 15:14
@SeanBright Oh, that makes sense... I can't see deleted answers on this site. — Questionmark, Jan 17 '17 at 21:27

score 0 · Answer 2 · answered Dec 21 '16 at 19:16

Try this:

read file 2 into an array (assuming the file has data sorted by timestamp, if not sort the array)
read file 1 (pre-sort it by timestamp if required) line by line; for each line, do binary search on the array to find the row that falls in the interval, merge the found line with the line from file 1

Merge two files under a certain condition

2 Answers2