Combine the data of two files having same column using awk

Question

I have two files: file1 and file2. i want to combine the time between file1 and file2 with common column as 1 and 2. I need to create a separate file which contains the file1 data and file2 mapping time column. But for some entires(see BOLD TEXT) its matching the same line from file2. Whats wrong in my command? is there any way to get the expected output?

Note: there can be more than 2 entries as shown in the bold text. column1 and column2 entries can be same while only time changes.

file 1

9950 | p | 2021-02-27 14:16:21.905
9951 | q | 2021-02-27 14:16:28.845
9954 | r | 2021-02-27 14:17:21.884
9954 | r | 2021-02-27 14:19:46.775
9956 | y | 2021-02-27 14:20:20.830
9957 | z | 2021-02-27 14:20:27.325

file 2

9950 | p | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.956
9956 | y | 2021-02-27 14:20:21.018
9957 | z | 2021-02-27 14:20:27.512

command:

 awk -F'|' 'NR==FNR{a[$1,$2]=$3;next} a[$1,$2]{print $0"\t"a[$1,$2]}' file2 file1

Output:

9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:19:46.956
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
9956 | x | 2021-02-27 14:20:20.830 | 2021-02-27 14:20:21.018
9957 | y | 2021-02-27 14:20:27.325 | 2021-02-27 14:20:27.512

Expected output:

9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
9956 | x | 2021-02-27 14:20:20.830 | 2021-02-27 14:20:21.018
9957 | y | 2021-02-27 14:20:27.325 | 2021-02-27 14:20:27.512

Your key `$1,$2` is not unique. Only the _last_ value from file2 will be stored in the `a` array. — glenn jackman, Feb 27 '21 at 17:26

James Brown · Answer 1 · 2021-02-27T19:02:54.200

$ awk '
BEGIN {
    FS=" [|] "
    OFS=" | "
}
NR==FNR {
    a[$1,++c[$1]]=$0          # c[$1] is an instance counter array, 
    next                      # unique for each $1
}
{
    print a[$1,++d[$1]],$NF   # so is d[$1]
}' file1 file2

Output:

9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
...

Updated with request from comments:

$ awk '
BEGIN {
    FS=" [|] "
    OFS=" | "
}
NR==FNR {
    a[$1,++c[$1]]=$0          # c[$1] is an instance counter array, 
    next                      # unique for each $1
}
(($1,++d[$1]) in a) {         # only output if match in both files
    print a[$1,d[$1]],$NF     # so is d[$1]
}' file1 file2

Hii, thanks for the solution. but there one thing when there is no mapping in request lets say 9951 transaction doesn't exists in file1 but it exists in file2. then i want to drop that entry. how can i do that? — shreyash, Feb 27 '21 at 18:50

score 3 · Accepted Answer · answered Feb 27 '21 at 17:37

3

Similar to James Brown's answer, but only needing to keep one "count" array, and using $1 plus $2 for the key

gawk -F '|' '
    FNR == 1 {delete count}
    {key = $1 SUBSEP $2 SUBSEP (++count[$1,$2])}
    FILENAME == ARGV[1] {time[key] = $3; next}
    key in time {print $0, FS, time[key]}
' file2 file1

answered Feb 27 '21 at 17:37

glenn jackman

238,783
38
220
352

Hey @glenn. Thanks for your answer. Its working perfectly in my scenario. Can you plase explain me the above gawk command. – shreyash Mar 01 '21 at 08:02

Ed Morton · Answer 3 · 2021-02-28T20:31:31.793

If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:

$ paste file1 file2 | awk '{$7=$8=$9=""; $0=$0; $1=$1}1'
9950 | p | 2021-02-27 14:16:21.905 | 2021-02-27 14:16:22.086
9951 | q | 2021-02-27 14:16:28.845 | 2021-02-27 14:16:29.033
9954 | r | 2021-02-27 14:17:21.884 | 2021-02-27 14:17:22.072
9954 | r | 2021-02-27 14:19:46.775 | 2021-02-27 14:19:46.956
9956 | y | 2021-02-27 14:20:20.830 | 2021-02-27 14:20:21.018
9957 | z | 2021-02-27 14:20:27.325 | 2021-02-27 14:20:27.512

In the above $7=$8=$9="" set those fields to NULL strings but does not change how many fields are in $0, then $0=$0 causes awk to resplit $0 thereby removing those as individual fields but not altering the white space around where they existed, then $1=$1 causes awk to reconstruct $0 from its fields replacing every string that matches FS (i.e. contiguous white space) with the string that matches OFS (i.e. a single blank char). For example (with | tr ' ' '-' added to make the spaces more clearly visible):

$ echo 'a b c' | awk '{print NF, $0}' | tr ' ' '-'
3-a-b-c
$ echo 'a b c' | awk '{$2=""; print NF, $0}' | tr ' ' '-'
3-a--c
$ echo 'a b c' | awk '{$2=""; $0=$0; print NF, $0}' | tr ' ' '-'
2-a--c
$ echo 'a b c' | awk '{$2=""; $0=$0; $1=$1; print NF, $0}' | tr ' ' '-'
2-a-c

Can you talk about how using _both_ `$0=$0` and `$1=$1` works to remove extraneous spaces? — glenn jackman, Feb 27 '21 at 18:45

Combine the data of two files having same column using awk

3 Answers3