0

I have two files wherein file1 spans around 6000 lns containing Id matching value and file2 1000lns containing ID. I want to retrieve value from file1 for only those IDs in file2.

file1.tsv:

RtGene0001      rt_000005300
RtGene0002      rt_000005400
RtGene0003      rt_000005500
RtGene0004      rt_000005600
RtGene0005      rt_000005700
RtGene0006      rt_000005800
RtGene0007      rt_000005900
RtGene0008      rt_000006000
RtGene0008      rt_000006100
RtGene0009      rt_000006200

file2.tsv:

RtGene0006
RtGene0007
RtGene0013
RtGene0014

Expected output:

RtGene0006      rt_000005800
RtGene0007      rt_000005900

I have tried few awk scripts but they keep showing in NA in the value column.

scripts:

awk 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1]=$2; next} {print $1, (($1 in a)?a[$1]:"NAN")}' file1.tsv file2.tsv

Output I got:

RtGene00NAN
RtGene00NAN
RtGene00NAN
RtGene00NAN
RtGene00NAN
RtGene00NAN
RtGene00NAN
RtGene00NAN
RtGene00NAN
RtGene00NAN
awk -f filename1.awk file1.tsv file2.tsv 
filename1.awk
BEGIN{FS=OFS="\t"} 
FNR==NR{
  a[$1]=$2
  next
}
{ if ($1 in a) {print $1, a[$1]} else {print $1,"\t", "#NA"} }

Output I got:

RtGene0006              #NA
RtGene0007              #NA

Kindly help me address what I could be missing and give suggestions on how to work around the issue.

nivitian
  • 13
  • 4
  • Your first demo shows `file.tsv` instead of `file2.tsv`. When I run it with `file2.tsv` as the second file I get `RtGene0006 rt_000005800` `RtGene0007 rt_000005900` `RtGene0013 NAN` `RtGene0014 NAN` – tripleee Apr 04 '23 at 08:50
  • Similarly for your second example I get `RtGene0006 rt_000005800` `RtGene0007 rt_000005900` `RtGene0013 #NA` `RtGene0014 #NA` – tripleee Apr 04 '23 at 08:51
  • The `file1.tsv` example you show seems to be space-separated, not tab-separated. With that change, I get results which look more like what you report. Is your problem that your TSV is not actually tab-separated values? – tripleee Apr 04 '23 at 08:53
  • Yes, that was a typo. I have edited it. I thought it would run seamlessly, however when I try to run them I am getting NAN/NA instead of matching values – nivitian Apr 04 '23 at 08:54
  • I did awk -F"\t" '{print $1}' on both files separately and got the first column. I confirmed them to be tab separated. I – nivitian Apr 04 '23 at 08:55
  • Demo: https://ideone.com/nJvRI5 – tripleee Apr 04 '23 at 08:57
  • Your demo was so helpful to confirm my code. I guess it is the problem with my file. The ending of my file1. tsv `RtGene0022^IRtart_000007800$` whereas file2 `RtGene6226^M$` `RtGene6386$`. Not sure how to work around this. – nivitian Apr 04 '23 at 09:21
  • Thanks a lot. I was able to fix it using `tr -d '\015' file_test2.tsv` – nivitian Apr 04 '23 at 10:07

0 Answers0