How to replace all elements of a column with the first 30 characters of each element in that column (in R)?

Question

I am trying to merge two data frames, one with the first 30 nucleotides (or characters) of a sequence, repeated once per nucleotide (so 30 repeats per sequence). Here is a subset of that data frame:

Data Frame 1

The second data frame has each full ORF sequence once, with associated Prot. Molecules per cell scores for each sequence. I want to match each 30nt sequence (and all its repeats) from the first data frame with the Prot. Molecules per cell counts from the second data frame. Here is a subset of the second data frame:

Data Frame 2

My general thoughts were to find a way to replace each sequence in the second data frame with only the first 30 nucleotides in that sequence and then use the merge() function. However, I am afraid I don't know how to slice the sequences, and I am also worried that the merge() function in R will remove the repeats of each 30 nucleotide sequence in the first data frame.

Would greatly appreciate any help!

Please give us a minimal, reproducible example using `dput`. — tyluRp, Dec 01 '17 at 16:54
Use `substr` to take the first 30 characters. And don't worry about the merge function. If you try it out, you will find your worries are unfounded. — Gregor Thomas, Dec 01 '17 at 16:54
If you want a solution with code, you should provide code as tyluRp says. Your images of data look nice but don't copy/paste well; `dput(droplevels(head(your_data)))` looks bad but copy/pastes beautifully. [See more tips here on making reproducible examples in R](https://stackoverflow.com/q/5963269/903061). — Gregor Thomas, Dec 01 '17 at 16:57
if you're able to join by trimming the string, you'll also be able to join by: 1) sort the strings 2) assign an ID to each according to that order (so A gets 1, AB gets 2, B gets 3, ...) 3) join by that ID. data.table's .GRP object is specifically designed for similar use cases, please check it out. — MichaelChirico, Dec 01 '17 at 17:15

score 0 · Answer 1 · answered Dec 01 '17 at 16:57

0

# subset string

a = 'CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC'
b = 'CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG'

df = setNames(data.frame(rbind(a, b)), 'seq')
df$char_30 = substr(df$seq, 1, 30)`
head(df)

answered Dec 01 '17 at 16:57

bringtheheat

90
8

@KickButtowski what link we we talking about? i provided my own toy data since the op put images up. what link are you talking about? – bringtheheat Dec 03 '17 at 17:35

How to replace all elements of a column with the first 30 characters of each element in that column (in R)?

1 Answers1