0

I have two files; the first looks like this,

        SomeFile.CEL    SomeOtherFile.CEL
probe1  111     666
probe2  222     777
probe3  333     888
probe4  444     999
probe5  555     100
probe6  101     102

and the second looks like this (note: the duplicate probe4, has two different gene names),

probe1  Gene1
probe2  Gene2
probe3  Gene3
probe4  Gene4A
probe4  Gene4B
probe5  Gene5
probe7  Gene6

What I need is an output file to look like this,

Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100

This ideal output file would contain all of the gene names which had matched probe names between the two files. Additionally, where multiple names exist for a single probe, I want the expression data (444 999) to be duplicated for all possible gene names (this example shows 2 gene names for a single probe, but it could be as many as 5 or 6.) By the way all files are tab-separated.

I've searched through this and other forums and while these article came close,

Replace multiple arguments with gsub

awk print column $3 if $2==a specific value?

awk partly string match (if column partly matches)

Sed pattern to match and replace

Regex match and replace

they don't answer my full question.

So far, I have had the most success with this command,

awk -F"\t" 'FILENAME=="input1.file"{a[$1]=$1} FILENAME=="input2.file {if(a[$1]){$1="";print $0}}'   input1.file input2.file

but it doesn't account for the necessary duplication. Lastly, there are some files which look like input1, but contain more than just the two samples I described (someFile.CEL and someOtherFile.CEL). There could be as many as 50 samples/CEL files. I figure I might have to build a script, but I thought I'd ask if there was a simpler way first.

Community
  • 1
  • 1
Greyson B
  • 77
  • 6

3 Answers3

2
$ awk 'NR==FNR{a[$1]=$2 FS $3; next} $1 in a{print $2, a[$1]}' file1 file2
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • This works great. Except, what do I do if I have more than 2 columns of data in the first inputFile? The first column is always a list like I've shown, but there could be as many as 50 columns of data following the first. – Greyson B May 02 '16 at 19:28
  • 1
    It's important when you ask a question to include an example that truly represents your real data. There's little worse than trying to help someone and then hearing "great but my data's not REALLY like that, it's actually like this....". Edit your question to include some more truly representative input/output so we're not peeling an onion. – Ed Morton May 02 '16 at 19:31
  • Look, some of my data actually looks like this, some is more detailed (additional columns) but the same format. I'm merely asking a followup. Thank you for your help – Greyson B May 02 '16 at 19:34
  • 1
    And is there any particular reason for you to not include the `some is more detailed` data in your question so we can test a potential solution against it rather than posting something untested and then if/when it fails having to re-work it again? – Ed Morton May 02 '16 at 19:40
1

There is a not so well known unix tool for joining files on a (sorted) common column, called join. You can use it in your case like this:

join <( sort file2.txt) <(sort file1.txt ) | cut -d\  -f2-
  • the sorts are required for nonsorted files
  • the cut is required to strip away the first column with the probe names
  • due to the sorting and cutting, awk is probably faster
Lars Fischer
  • 9,135
  • 3
  • 26
  • 35
1

join GNU command was made for this exact situation, and it can be combined with awk.

This one liner version is working with any number of columns (FIELDS) in first file:

 join SomeFile.CEL SomeOtherFile.CEL | awk '{$1=$NF; $NF=""; print}'

By default the first FIELD of both files is used for the JOIN.

The 2 files must be sorted on the join fields.

The test with 2 additional sort to ensure that the JOIN fields are sorted:

$ join <(sort SomeFile.CEL) <(sort SomeOtherFile.CEL) | awk '{$1=$NF; $NF=""; print}'
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100

Second test with another first file with more columns:

$ cat SomeFile_ManyColumns.CEL
probe1  111     666 666    111 777 888 999
probe2  222     777 111    666 999 888 777
probe3  333     888 101    102 999 888 111
probe4  444     999 876    543 321 678 101
probe5  555     100 101    543 321 666 999
probe6  101     102 888    321 543 101 678

$ join <(sort SomeFile_ManyColumns.CEL) <(sort SomeOtherFile.CEL) | awk '{$1=$NF; $NF=""; print}'
Gene1 111 666 666 111 777 888 999
Gene2 222 777 111 666 999 888 777
Gene3 333 888 101 102 999 888 111
Gene4A 444 999 876 543 321 678 101
Gene4B 444 999 876 543 321 678 101
Gene5 555 100 101 543 321 666 999

----

For history a solution with a fix number of columns (FIELDS):

join -o 2.2,1.2,1.3 SomeFile.CEL SomeOtherFile.CEL

-o 2.2,1.2,1.3 specify the output FORMAT: it is one or more comma or blank separated specifications, each being `FILENUM.FIELD'

The test:

$ join -o 2.2,1.2,1.3 SomeFile.CEL SomeOtherFile.CEL
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100
Jay jargot
  • 2,745
  • 1
  • 11
  • 14
  • @GreysonB The first example at the top, is working with any number of columns in the first file `SomeFile.CEL` – Jay jargot May 02 '16 at 20:07
  • So if I have more than three columns in input1.file, could I use something like, `join -o -a1 2.2,0 input1.file input2.file` ? – Greyson B May 02 '16 at 20:09
  • @GreysonB use directly `join <(sort SomeFile.CEL) <(sort SomeOtherFile.CEL) | awk '{$1=$NF; $NF=""; print}'` – Jay jargot May 02 '16 at 20:43
  • @GreysonB a second test had been added to illustrate that the same one liner command is generated good output with different first files with different number of columns. – Jay jargot May 02 '16 at 21:39