comparing two files based on a column and append elements in common to a file

Question

Basically I want to combine the power of grepf with awk or 'bash' commands. I have two files like this:

$file1
ENSG00000000003 TSPAN6  ensembl_havana  TSPAN6
ENSG00000000419 DPM1    ensembl_havana  DPM1
ENSG00000000457 SCYL3   ensembl_havana  SCYL3
ENSG00000000460 C1orf112    ensembl_havana  C1orf112
ENSG00000000971 CFH ensembl_havana  CFH
ENSG00000001036 FUCA2   ensembl_havana  FUCA2

$file2
ENSG00000000003.12  0.0730716237772557  -0.147970450702234
ENSG00000000419.5   0.156405616866614   -0.0398488625782745
ENSG00000000457.3   -0.110396121325736  -0.0147093758392248
ENSG00000000460.15  -0.0457144601264149 0.322340330477282
ENSG00000000971.12  0.0613967504891434  -0.0198254029339757
ENSG00000001036.4   0.00879628204710496 0.0560438506950908

And here my desired output

ENSG00000000003.12  TSPAN6  0.0730716237772557  -0.147970450702234
ENSG00000000419.5   DPM1    0.156405616866614   -0.0398488625782745 
ENSG00000000457.3   SCYL3   -0.110396121325736  -0.0147093758392248 
ENSG00000000460.15  C1orf112    -0.0457144601264149 0.322340330477282   
ENSG00000000971.12  CFH 0.0613967504891434  -0.0198254029339757 
ENSG00000001036.4   FUCA2   0.00879628204710496 0.0560438506950908

This output also will be useful

ENSG00000000003 TSPAN6  0.0730716237772557  -0.147970450702234
ENSG00000000419 DPM1    0.156405616866614   -0.0398488625782745 
ENSG00000000457 SCYL3   -0.110396121325736  -0.0147093758392248 
ENSG00000000460 C1orf112    -0.0457144601264149 0.322340330477282   
ENSG00000000971 CFH 0.0613967504891434  -0.0198254029339757 
ENSG00000001036 FUCA2   0.00879628204710496 0.0560438506950908

I have tried the command from Obtain patterns from a file, compare to a column of another file, print matching lines, using awk

awk 'NR==FNR{a[$0]=1;next} {n=0;for(i in a){if($0~i){print; break}}} n' file2 file

But obviously it does not give me the desired output

Thanks

score 1 · Accepted Answer · answered Mar 05 '15 at 19:21

1

With awk:

awk 'NR == FNR { a[$1] = $2; next } { split($1, b, "."); print $1, a[b[1]], $2, $3 }' file1 file2

This works as follows:

NR == FNR {                  # While processing the first file
  a[$1] = $2                 # just remember the second field by the first
  next
}
{                            # while processing the second file
  split($1, b, ".")          # split first field to isolate the key
  print $1, a[b[1]], $2, $3  # print relevant fields and the remembered
                             # bit from the first file.
}

answered Mar 05 '15 at 19:21

Wintermute

42,983
5
77
80

nit-pick: the 3rd arg for split() is a regexp so you should get into the habit of using regexp delimiters (`/./`) for it instead of string delimiters (`"."`) since the latter cause awk to have to read the content twice to convert the string to a regexp before using it and that leads to unnecessary complications in some cases when specifying the content. – Ed Morton Mar 05 '15 at 20:17
`/./` doesn't work (`/\./` does). gawk and mawk have a string `fs` parameter to `split` behave like `FS`, which is only a regex if it's longer than a single character (so written in [POSIX](http://pubs.opengroup.org/onlinepubs/009696699/utilities/awk.html#tag_04_06_13_04)), but if I read the part of POSIX about the `split` function correctly, that behavior is not strictly conforming. Hmm...you have access to nawk, don't you? Does it behave like gawk and mawk with `"."`? – Wintermute Mar 05 '15 at 20:29
Sorry, `/\./` is what I meant. gawk, nawk, and /usr/xpg4/bin/awk will all treat `"."` as a literal `.`. Yes, the 3rd arg for split is a field separator which is a regexp with some additional properties that match how awk splits records into fields using FS (e.g. `FS=" "` will ignore leading and trailing white space and use contiguous white space as the separator). – Ed Morton Mar 05 '15 at 21:21
Here's the problem that's solved by using regexp delimiters: let's say you want to split strings wherever `b.d` occurs where the `.` is literal. `split("abcdeb.df",a,"b.d"); print a[2]` wiill incorrectly print `e`. The obvious step of changing `"b.d"` to `"b\.d"` will still print `e` (and produce `warning: escape sequence `\.' treated as plain `.'` in gawk). You need to double-escape the `.` since you are using a string in a regexp context, e.g. `"b\\.d"` to get the correct output, `f`. If on the other hand you were using regexp delimiters all along then all you need is the intuitive `/b\.d/`. – Ed Morton Mar 05 '15 at 21:37
Well, yes, if you want (or are forced) to use regexes, using explicit regex syntax makes sense, but that's not really the case here. To convert a one-character string into a regex only to escape the character and make it behave like the string again feels kind of circuitous. I understand that you argue for consistency, which is valuable, but since this is consistent with the way we (have to) use `FS`, I don't think using `"."` is a problem in that regard, and the intent is, in my view, clearer this way. Both work, though, so I'll chalk this up as a matter of opinion. – Wintermute Mar 05 '15 at 22:01
It's not a big deal either way but IMHO it's not obvious when I see `split(...,".")` that the `.` is going to be treated literally as opposed to `split(...,"a.b")` where it will be treated as a metacharacter unless I double-escape it as `split(...,"a\\.b")`. On the other hand if I see `split(...,/\./)` then I KNOW what that means and it's exactly the same meaning as when it appears in `split(...,/a\.b/)`. Just keeps it nice and simple for me, but YMMV I suppose. – Ed Morton Mar 05 '15 at 22:46

score 1 · Answer 2 · answered Mar 05 '15 at 19:35

$ awk 'NR==FNR{m[$1]=$2;next} {sub(/[[:space:]]/," "m[$1])} 1' file1 FS='.' file2
ENSG00000000003.12 TSPAN6 0.0730716237772557  -0.147970450702234
ENSG00000000419.5 DPM1  0.156405616866614   -0.0398488625782745
ENSG00000000457.3 SCYL3  -0.110396121325736  -0.0147093758392248
ENSG00000000460.15 C1orf112 -0.0457144601264149 0.322340330477282
ENSG00000000971.12 CFH 0.0613967504891434  -0.0198254029339757
ENSG00000001036.4 FUCA2  0.00879628204710496 0.0560438506950908

comparing two files based on a column and append elements in common to a file

2 Answers2