I have two data.frames with wonky formatting. One is a large reference and the other is a subset that I would like to look up for pulling additional data from the reference, but formatting is difficult.
The smaller subset looks like this:
> head(lookup, n = 2)
gene_id class_code nearest_ref_id
1 XLOC_001184 <NA> <NA>
2 XLOC_001225 <NA> <NA>
> gene_short_name
1 ORF%20Transcript_11308%7Cg.37058%20Transcript_11308%7Cm.37058%20type%3Acomplete%20len%3A195%20%28%2B%29
2 ORF%20Transcript_11347%7Cg.37236%20Transcript_11347%7Cm.37236%20type%3A5prime_partial%20len%3A87%20%28%2B%29
locus length coverage
1 Transcript_11308:0-1727 NA NA
2 Transcript_11347:0-1584 NA NA
And the reference looks like the following (Note: Some sequence were manually removed so that they are not too long to display here):
> head(refRna, n=2)
seq_names sequences
1 Transcript_0 len=550 GTTTTATTTGTTGTTGTTGTTGTTTTTATATGTA
2 Transcript_1 len=760 GACCACACCACTCGTCTGAATTCTCGATGTGGAA
There is a space in the reference$seq_names
, a :
in the lookup$locus
with some extra numbers after it.
Some of the reference$seq_names
have extra info with more spaces. For example:
4 Transcript_3 len=440 CDS=1-439 exon=0-440 five_prime_UTR=439-440 gene=0-440 mRNA=0-440 three_prime_UTR=0-1
The Transcript_1234
bit is a unique identifier.
Ultimately I would like to retrieve the reference$sequences
for each lookup$locus
and append it to a new column, lookup$sequence
or create a new data frame with only the XLOC_1234
, Transcript_1234
bit and the corresponding sequence. Appreciate any advice.