42

File 1 has 5 fields A B C D E, with field A is an integer-valued

File 2 has 3 fields A F G

The number of rows in File 1 is much bigger than that of File 2 (20^6 to 5000)

All the entries of A in File 1 appeared in field A in File 2

I like to merge the two files by field A and carry F and G

Desired output is A B C D E F G

Example

File 1

 A     B     C    D    E
4050 S00001 31228 3286 0
4050 S00012 31227 4251 0
4049 S00001 28342 3021 1
4048 S00001 46578 4210 0
4048 S00113 31221 4250 0
4047 S00122 31225 4249 0
4046 S00344 31322 4000 1

File 2

A     F    G   
4050 12.1 23.6
4049 14.4 47.8   
4048 23.2 43.9
4047 45.5 21.6

Desired output

A    B      C      D   E F    G
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
Dropout
  • 13,653
  • 10
  • 56
  • 109
Tony
  • 2,889
  • 8
  • 41
  • 45
  • 1
    did you leave out the row starting with `4046` in File 2 on purpose or was that an accidental omission? It's important because your desired output does not show the `4046` row yet earlier you say all File 1 rows are accounted for in File 2. – SiegeX Mar 29 '11 at 06:59
  • @SiegeX. Sorry, it was an accidental omission. – Tony Mar 29 '11 at 22:55
  • Out of the triples, I think this example is the best. – Kemin Zhou May 10 '18 at 18:07

4 Answers4

49
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1

Explanation: (Partly based on another question. A bit late though.)

FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands} means that the commands inside the brackets only executed while processing the first file (file2 now).

FS refers to the field separator and $1, $2 etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3 means that a dictionary(/array) (named a) is filled with $1 key and $2 FS $3 value.

; separates the commands

next means that any other commands are ignored for the current line. (The processing continues on the next line.)

$0 is the whole line

{print $0, a[$1]} simply prints out the whole line and the value of a[$1] (if $1 is in the dictionary, otherwise only $0 is printed). Now it is only executed for the 2nd file (file1 now), because of FNR==NR{...;next}.

fanyul
  • 148
  • 6
kurumi
  • 25,121
  • 5
  • 44
  • 52
  • 3
    @kurumi There was no merging, only file 1 was printed – Tony Mar 29 '11 at 04:29
  • @tony, no, i tested it. see my edit – kurumi Mar 29 '11 at 04:33
  • @kurumi I came up with the nearly the same answer before seeing yours but it's better to upvote those that got there first than to duplicate. So before I delete mine and upvote you, you should edit yours to add the `(a[$1])` portion as in my answer. – SiegeX Mar 29 '11 at 05:29
  • @Siegex, if you look at OP's question: He says, `All the entries of A in File 1 appeared in field A in File 2`. I am suspecting he wants those that are also not matched. Until further clarification, I will not edit my answer as yet. But thanks for the heads up as well. – kurumi Mar 29 '11 at 05:44
  • @kurumi Sorry The code works well now -Cheers – Tony Mar 29 '11 at 06:08
  • Apparently the check for `a[$1]` doesn't matter. +1 and deleted – SiegeX Mar 29 '11 at 17:35
  • 1
    can you elaborate on `a[$1]=$2` ? how did u get $2? – jxn Jun 28 '15 at 04:22
  • 2
    bit of explanation wouldn't hurt. – Raymond gsh Jun 14 '17 at 14:24
32

Thankfully, you don't need to write this at all. Unix has a join command to do this for you.

join -1 1 -2 1 File1 File2

Here it is "in action":

will-hartungs-computer:tmp will$ cat f1
4050 S00001 31228 3286 0
4050 S00012 31227 4251 0
4049 S00001 28342 3021 1
4048 S00001 46578 4210 0
4048 S00113 31221 4250 0
4047 S00122 31225 4249 0
4046 S00344 31322 4000 1
will-hartungs-computer:tmp will$ cat f2
4050 12.1 23.6
4049 14.4 47.8   
4048 23.2 43.9
4047 45.5 21.6
will-hartungs-computer:tmp will$ join -1 1 -2 1 f1 f2
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
will-hartungs-computer:tmp will$ 
Will Hartung
  • 115,893
  • 19
  • 128
  • 203
  • 4
    Using `join` requires the files in sorted order; the inputs shown are not in sorted order (or, not ascending sorted order). And it is probably an exercise in using `awk` so quite possibly your otherwise sensible (once corrected) solution is not acceptable. – Jonathan Leffler Mar 29 '11 at 04:04
  • @Will. Wow. Does join command require the same number of rows for both files? Thanks – Tony Mar 29 '11 at 04:07
  • @Jonathan Well, at least on my Mac, the join command did exactly what he wanted, and the files were not sorted, I simply cut/pasted them in to vi. – Will Hartung Mar 29 '11 at 04:21
  • @Tony No, the files don't (apparently) have to have the same number of rows. – Will Hartung Mar 29 '11 at 04:21
  • @Tony: no - the files can have different numbers of rows, but they do both have to be sorted in (the same) sorted order. The `join` command is powerful - it can handle left and right and full outer joins, for example - but has a syntax that is obscure to the point of obfuscation. – Jonathan Leffler Mar 29 '11 at 04:23
  • The POSIX standard says "The files file1 and file2 shall be ordered in the collating sequence of sort -b on the fields on which they shall be joined, ...". There's a decent chance that if the sets are dense and in reverse order that it will work - your experiment suggests that (I can confirm/reproduce that result). Stray outside that and you're likely to run into trouble. (Try deleting the two 4048 lines from File 1, for example.) – Jonathan Leffler Mar 29 '11 at 04:32
  • 1
    Yup, seems to have problems when they get out of order. It seems to not like items in the second file that are not in the first. Of course, if order is not important, there's sort(1). – Will Hartung Mar 29 '11 at 04:40
  • Where has this command been all my life? God I love SO. – shanusmagnus Aug 31 '14 at 20:55
6

You need to read the entries from File 2 into a pair of associative arrays in the BEGIN block. Assuming GNU Awk:

BEGIN { while (getline < "File 2") { f[$1] = $2; g[$1] = $3 } }

In the main processing block, you read the line from File 1 and print it with the correct data from the arrays created in the BEGIN block:

{ print $0, f[$1], g[$1] }

Supply File 1 as the filename argument to the program.

awk 'BEGIN { while (getline < "File 2") { f[$1] = $2; g[$1] = $3 } }
     print $0, f[$1], g[$1] }' "File 1"

The quotes around the file name argument are needed because of the spaces in the file name. You need the quotes around the getline filename even if it contained no spaces as it would otherwise be a variable name.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
-2
awk 'BEGIN{OFS=","}  FNR==NR {F[$1]=$2;G[$1]=$3;next} {print $1,$2,$3,$4,$5,F[$1],G[$1]}' file2.txt file1.txt
Tunaki
  • 132,869
  • 46
  • 340
  • 423
NAGAPPA
  • 1
  • 2
  • 4
    This answer turned up in the low quality review queue, presumably because you don't provide any explanation of the code. If this code answers the question, consider adding adding some text explaining the code in your answer. This way, you are far more likely to get more upvotes — and help the questioner learn something new. – lmo Aug 28 '16 at 15:02
  • 1
    Could you please provide a little explanation? Really appreciate. – Praveen Mishra Dec 04 '19 at 12:05