2

I'm trying to find if there is any conditional dependence within 2 different DNA sequences in R

This is my code, however i'm getting an error;

Error in `[.data.frame`(data, i) : undefined columns selected

I'm not sure where the issue is, if I parentheses the data[i-1]==bases[b2], i just get multiple unexpected}, which is the only thing I can think else to do.

for (b1 in 1:length(bases))
{
   for (b2 in 1:length(bases))
   {
       count = 1
       for (i in 2:length(mydata1))
       {
           if ((mydata1[i]==bases[b1]) & mydata1[i-1]==bases[b2])
           {
               count = count+1
           }
       }
       b3 = c(bases[b1], bases[b2], count)
       print(b3)
   }
}

_I'm expecting essentially a list of certain DNA bases, for example I see it as if the DNA sequence IS conditional upon the previous base then;.

[1] "A" "C" "002"
[1] "A" "C" "005"
[1] "A" "C" "009"

and so on, that can show me any indication as to whether a certain base has any sort of affect upon the identity of the following base, by clearly showing a condition for A to be previous to C.

Ok so essentially the mydata1 (there is also mydata2) are DNA sequences, that is to say a list of "A", "G", "C" and "T", each of which is 10,000 bases long.

As shown here;

  V1
1  T
2  C
3  G
4  G
5  T
6  G
7  G 
8  G
9  C
10 A

I'm tasked with trying to determine if the sequence has bases that are dependent on one another, so if [1] T affects the presence of [2] C, etc. One of the sequences is dependent, the other is not.

daenwaels
  • 85
  • 1
  • 7
  • 1
    It's a lot easier for people to help if you provide a [reproducible minimal example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) *including expected output and sample data*. – Maurits Evers Nov 27 '17 at 21:05
  • Can you add some basic on the structure of your data? You use `length(mydata1)`, but never refer to this data again but you use a data structure `data` in the body of the `for` loop. Hard to tell what is causing the problems. – PhillipD Nov 27 '17 at 21:06
  • `mydata` and `data` are different variable names. Given your code’s logic, they should almost certainly be the same. – Konrad Rudolph Nov 27 '17 at 21:12
  • Apologies, i'm very new to this entire thing. I've updated my initial post with corrections. Not entirely sure this will help more at all, my bad if it doesn't. – daenwaels Nov 27 '17 at 21:16
  • Try to show us the `head()` of all of the data you are using-- it doesn't have to be the real data, but it needs to have the same type of values (i.e. character/numeric/etc.), I think what you're doing can be done without all of those for-loops, but no idea what we're working with, here. – Phantom Photon Nov 27 '17 at 21:24
  • I've added in the first 10 lines of my data, if that helps you to get how i'm approaching this. – daenwaels Nov 27 '17 at 21:33
  • What does `"A" "A" "058"` mean? A followed by A is 58 times? – zx8754 Nov 27 '17 at 21:36
  • @zx8754 that is saying at position 058, A follows A. And thanks for the tip, I wasn't aware! – daenwaels Nov 27 '17 at 21:44
  • Sorry, maybe wasn't clear, if your data was 10 rows, what would be the expected output? – zx8754 Nov 27 '17 at 21:53
  • 1
    @zx8754 Apologies for this, I think possibly I'm looking at this in the complete wrong light. For what I think would be the output I've updated that section in the post. If that was what you're trying to ask for? – daenwaels Nov 27 '17 at 21:58
  • If `data` or `mydata` is a `dataframe`, `mydata[i]` returns the ith column not the ith row. If you want to return the ith row of all columns, you need `mydata[i,]`. If you want to return the ith row of the jth column, you do `mydata[i,j]`. – Lamia Nov 27 '17 at 22:24

1 Answers1

1

If I understand correctly, you want to count the occurrences of each pair of nucleotides i, i+1 in a sequence of DNA. You can achieve this with R function table; an example is provided below.

# input sequence
seq <- "ACGTACTGCACAAACTAC"

# length of input sequence
length_seq <- nchar(seq, type="chars")

# first substring: from 1 to second-last 
seq1 <- substr(seq, 1, (length_seq - 1))

# second substring: from 2 to last
seq2 <- substring(seq, 2, length_seq)

# split strings
seq1_split <- strsplit(seq1, "")[[1]]
seq2_split <- strsplit(seq2, "")[[1]]

# initialize vectors
first_nt <- vector(mode="character", length = (length_seq - 1))
second_nt <- vector(mode="character", length = (length_seq -1))

# fill vectors
count = 0
for (b in seq1_split)
{
    count = count + 1
    first_nt[count] <- b
}

count = 0
for (b in seq2_split)
{
    count = count + 1
    second_nt[count] <- b
}

# create matrix with character i and i+1 in each row
mat <- matrix(c(first_nt, second_nt), nrow=(length_seq - 1))

# collapse matrix
to_table <- apply(mat, 1, paste, collapse="")

# table
my_table <- table(to_table)

print(my_table)