1

I have two data.frames with wonky formatting. One is a large reference and the other is a subset that I would like to look up for pulling additional data from the reference, but formatting is difficult.

The smaller subset looks like this:

> head(lookup, n = 2)

      gene_id class_code nearest_ref_id
1 XLOC_001184       <NA>           <NA>
2 XLOC_001225       <NA>           <NA>


> gene_short_name
1      ORF%20Transcript_11308%7Cg.37058%20Transcript_11308%7Cm.37058%20type%3Acomplete%20len%3A195%20%28%2B%29
2 ORF%20Transcript_11347%7Cg.37236%20Transcript_11347%7Cm.37236%20type%3A5prime_partial%20len%3A87%20%28%2B%29

                    locus length coverage
1 Transcript_11308:0-1727     NA       NA
2 Transcript_11347:0-1584     NA       NA

And the reference looks like the following (Note: Some sequence were manually removed so that they are not too long to display here):

> head(refRna, n=2)
             seq_names   sequences
1 Transcript_0 len=550   GTTTTATTTGTTGTTGTTGTTGTTTTTATATGTA
2 Transcript_1 len=760   GACCACACCACTCGTCTGAATTCTCGATGTGGAA

There is a space in the reference$seq_names, a : in the lookup$locus with some extra numbers after it.

Some of the reference$seq_names have extra info with more spaces. For example:

4        Transcript_3 len=440 CDS=1-439 exon=0-440 five_prime_UTR=439-440 gene=0-440 mRNA=0-440 three_prime_UTR=0-1

The Transcript_1234 bit is a unique identifier.

Ultimately I would like to retrieve the reference$sequences for each lookup$locus and append it to a new column, lookup$sequence or create a new data frame with only the XLOC_1234, Transcript_1234 bit and the corresponding sequence. Appreciate any advice.

David C.
  • 1,974
  • 2
  • 19
  • 29
user974887
  • 2,309
  • 3
  • 17
  • 18

1 Answers1

2

Based on what you provided, here I am giving you some advice/stratigies to tackle your problem. The assumption is that your data types are data.frames not specialized classes such as GenomicRanges. (Use the function class to double check).

  • First, you will need to clean up the column reference$seq_names by stripping away the extra space as you mentioned. Functions such as gsub and str_replace will be helpful. See this StackOverflow post.
  • Similarly, the lookup data.frame needs to be cleaned up. To drop texts/numbers after :, The function gsub is your friend; you may combine it with regular expression to drop everything after a symbol of interest. For example:

    lookup$locus <- gsub("\:.*", "", lookup$locus) #replaces everything after : with empty string

  • The "look-up" process you mentioned is known as a query. You'll need a key column to identify and match your smaller subset against the reference data set. The key in your reference appears to be the transcript ID, which you will need to extract from a longer string. Based on what you provided, one strategy is to drop everything after the space, and save as a new column. For example:

    reference$NEW_COLUMN <- gsub("\ .*", "", reference$seq_names)

  • To achieve your final task, you may want to perform a join operation (see this Wikipedia page for background). In R, we use the function merge(DATAFRAME1, DATAFRAME2, by=KEY).

David C.
  • 1,974
  • 2
  • 19
  • 29