1

I am trying to merge two data frames, one with the first 30 nucleotides (or characters) of a sequence, repeated once per nucleotide (so 30 repeats per sequence). Here is a subset of that data frame:

Data Frame 1

The second data frame has each full ORF sequence once, with associated Prot. Molecules per cell scores for each sequence. I want to match each 30nt sequence (and all its repeats) from the first data frame with the Prot. Molecules per cell counts from the second data frame. Here is a subset of the second data frame:

Data Frame 2

My general thoughts were to find a way to replace each sequence in the second data frame with only the first 30 nucleotides in that sequence and then use the merge() function. However, I am afraid I don't know how to slice the sequences, and I am also worried that the merge() function in R will remove the repeats of each 30 nucleotide sequence in the first data frame.

Would greatly appreciate any help!

Paulo Mattos
  • 18,845
  • 10
  • 77
  • 85
Will Barr
  • 11
  • 1
  • Please give us a minimal, reproducible example using `dput`. – tyluRp Dec 01 '17 at 16:54
  • Use `substr` to take the first 30 characters. And don't worry about the merge function. If you try it out, you will find your worries are unfounded. – Gregor Thomas Dec 01 '17 at 16:54
  • 2
    If you want a solution with code, you should provide code as tyluRp says. Your images of data look nice but don't copy/paste well; `dput(droplevels(head(your_data)))` looks bad but copy/pastes beautifully. [See more tips here on making reproducible examples in R](https://stackoverflow.com/q/5963269/903061). – Gregor Thomas Dec 01 '17 at 16:57
  • if you're able to join by trimming the string, you'll also be able to join by: 1) sort the strings 2) assign an ID to each according to that order (so A gets 1, AB gets 2, B gets 3, ...) 3) join by that ID. data.table's .GRP object is specifically designed for similar use cases, please check it out. – MichaelChirico Dec 01 '17 at 17:15

1 Answers1

0
# subset string

a = 'CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC'
b = 'CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG'

df = setNames(data.frame(rbind(a, b)), 'seq')
df$char_30 = substr(df$seq, 1, 30)`
head(df)
  • @KickButtowski what link we we talking about? i provided my own toy data since the op put images up. what link are you talking about? – bringtheheat Dec 03 '17 at 17:35