I have two files; the first looks like this,
SomeFile.CEL SomeOtherFile.CEL
probe1 111 666
probe2 222 777
probe3 333 888
probe4 444 999
probe5 555 100
probe6 101 102
and the second looks like this (note: the duplicate probe4, has two different gene names),
probe1 Gene1
probe2 Gene2
probe3 Gene3
probe4 Gene4A
probe4 Gene4B
probe5 Gene5
probe7 Gene6
What I need is an output file to look like this,
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100
This ideal output file would contain all of the gene names which had matched probe names between the two files. Additionally, where multiple names exist for a single probe, I want the expression data (444 999) to be duplicated for all possible gene names (this example shows 2 gene names for a single probe, but it could be as many as 5 or 6.) By the way all files are tab-separated.
I've searched through this and other forums and while these article came close,
Replace multiple arguments with gsub
awk print column $3 if $2==a specific value?
awk partly string match (if column partly matches)
Sed pattern to match and replace
they don't answer my full question.
So far, I have had the most success with this command,
awk -F"\t" 'FILENAME=="input1.file"{a[$1]=$1} FILENAME=="input2.file {if(a[$1]){$1="";print $0}}' input1.file input2.file
but it doesn't account for the necessary duplication. Lastly, there are some files which look like input1, but contain more than just the two samples I described (someFile.CEL and someOtherFile.CEL). There could be as many as 50 samples/CEL files. I figure I might have to build a script, but I thought I'd ask if there was a simpler way first.