DNA conditional frequency in R

Question

I'm trying to find if there is any conditional dependence within 2 different DNA sequences in R

This is my code, however i'm getting an error;

Error in `[.data.frame`(data, i) : undefined columns selected

I'm not sure where the issue is, if I parentheses the data[i-1]==bases[b2], i just get multiple unexpected}, which is the only thing I can think else to do.

for (b1 in 1:length(bases))
{
   for (b2 in 1:length(bases))
   {
       count = 1
       for (i in 2:length(mydata1))
       {
           if ((mydata1[i]==bases[b1]) & mydata1[i-1]==bases[b2])
           {
               count = count+1
           }
       }
       b3 = c(bases[b1], bases[b2], count)
       print(b3)
   }
}

_I'm expecting essentially a list of certain DNA bases, for example I see it as if the DNA sequence IS conditional upon the previous base then;.

[1] "A" "C" "002"
[1] "A" "C" "005"
[1] "A" "C" "009"

and so on, that can show me any indication as to whether a certain base has any sort of affect upon the identity of the following base, by clearly showing a condition for A to be previous to C.

Ok so essentially the mydata1 (there is also mydata2) are DNA sequences, that is to say a list of "A", "G", "C" and "T", each of which is 10,000 bases long.

As shown here;

I'm tasked with trying to determine if the sequence has bases that are dependent on one another, so if [1] T affects the presence of [2] C, etc. One of the sequences is dependent, the other is not.

It's a lot easier for people to help if you provide a [reproducible minimal example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) *including expected output and sample data*. — Maurits Evers, Nov 27 '17 at 21:05
Can you add some basic on the structure of your data? You use `length(mydata1)`, but never refer to this data again but you use a data structure `data` in the body of the `for` loop. Hard to tell what is causing the problems. — PhillipD, Nov 27 '17 at 21:06
`mydata` and `data` are different variable names. Given your code’s logic, they should almost certainly be the same. — Konrad Rudolph, Nov 27 '17 at 21:12
Apologies, i'm very new to this entire thing. I've updated my initial post with corrections. Not entirely sure this will help more at all, my bad if it doesn't. — daenwaels, Nov 27 '17 at 21:16
Try to show us the `head()` of all of the data you are using-- it doesn't have to be the real data, but it needs to have the same type of values (i.e. character/numeric/etc.), I think what you're doing can be done without all of those for-loops, but no idea what we're working with, here. — Phantom Photon, Nov 27 '17 at 21:24
I've added in the first 10 lines of my data, if that helps you to get how i'm approaching this. — daenwaels, Nov 27 '17 at 21:33
What does `"A" "A" "058"` mean? A followed by A is 58 times? — zx8754, Nov 27 '17 at 21:36
@zx8754 that is saying at position 058, A follows A. And thanks for the tip, I wasn't aware! — daenwaels, Nov 27 '17 at 21:44
Sorry, maybe wasn't clear, if your data was 10 rows, what would be the expected output? — zx8754, Nov 27 '17 at 21:53
@zx8754 Apologies for this, I think possibly I'm looking at this in the complete wrong light. For what I think would be the output I've updated that section in the post. If that was what you're trying to ask for? — daenwaels, Nov 27 '17 at 21:58
If `data` or `mydata` is a `dataframe`, `mydata[i]` returns the ith column not the ith row. If you want to return the ith row of all columns, you need `mydata[i,]`. If you want to return the ith row of the jth column, you do `mydata[i,j]`. — Lamia, Nov 27 '17 at 22:24

score 1 · Accepted Answer · answered Nov 28 '17 at 09:48

If I understand correctly, you want to count the occurrences of each pair of nucleotides i, i+1 in a sequence of DNA. You can achieve this with R function table; an example is provided below.

# input sequence
seq <- "ACGTACTGCACAAACTAC"

# length of input sequence
length_seq <- nchar(seq, type="chars")

# first substring: from 1 to second-last 
seq1 <- substr(seq, 1, (length_seq - 1))

# second substring: from 2 to last
seq2 <- substring(seq, 2, length_seq)

# split strings
seq1_split <- strsplit(seq1, "")[[1]]
seq2_split <- strsplit(seq2, "")[[1]]

# initialize vectors
first_nt <- vector(mode="character", length = (length_seq - 1))
second_nt <- vector(mode="character", length = (length_seq -1))

# fill vectors
count = 0
for (b in seq1_split)
{
    count = count + 1
    first_nt[count] <- b
}

count = 0
for (b in seq2_split)
{
    count = count + 1
    second_nt[count] <- b
}

# create matrix with character i and i+1 in each row
mat <- matrix(c(first_nt, second_nt), nrow=(length_seq - 1))

# collapse matrix
to_table <- apply(mat, 1, paste, collapse="")

# table
my_table <- table(to_table)

print(my_table)

DNA conditional frequency in R

1 Answers1