5

I am reading in a csv file into R that looks like this:

3,3
3,2
3,3
3,3
3,3
3,3
2,3
1,2
2,2
3,3

I want to assign a number to each of the 9 unique possibilities that my data can be (3 and 3 is 9, 3 and 2 is 8, 2 and 3 is 6, etc.). I have been trying do design a nested if statement that will evaluate each row, assign a number in a third column, and do this for each row in the data set. I believe this can be done with the apply function, but I am having trouble getting the if statement to work within the apply function. The two columns both have possible values of 1,2, or 3. This is my code thus far, just trying to assign a 9 to to 3/3 columns and 0 to everything else:

#RScript for haplotype analysis

#remove(list=ls())
options(stringsAsFactors=FALSE)
setwd("C:/Documents and Settings/ColumbiaPC/Desktop")

#read in comma-delimited, ID-matched genotype data
OXT <- read.csv("OXTRhaplotype.csv")
colnames(OXT)<- c("OXT1","OXT2")

OXT$HAP <- apply(OXT, 1, function(x) if(x[1]=="3"&&x[2]=="3")x[3]=="9" else 0))

Thanks for any help in advance.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
Bill
  • 51
  • 1
  • 2

4 Answers4

11

You can solve the problem you describe using a matrix and standard R subsetting, without any if statements

m <- matrix(1:9, nrow=3, byrow=TRUE)
m

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

This means you can index m using matrix subsetting:

m[3, 2]
[1] 8

m[3,3]
[1] 9

m[2,3]
[1] 6

And now you can apply this to your data:

df <- structure(list(V1 = c(3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 2L, 3L), 
        V2 = c(3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L)), .Names = c("V1", 
        "V2"), class = "data.frame", row.names = c(NA, -10L))

#df$m <- sapply(seq_len(nrow(df)), function(i)m[df$V1[i], df$V2[i]])
df$m <- m[as.matrix(df)]  # Use matrix subsetting, suggested by @Aaron
df

   V1 V2 m
1   3  3 9
2   3  2 8
3   3  3 9
4   3  3 9
5   3  3 9
6   3  3 9
7   2  3 6
8   1  2 2
9   2  2 5
10  3  3 9
Andrie
  • 176,377
  • 47
  • 447
  • 496
5

Unfortunately, I came late and with a solution similar to @Andrie's one, like this:

dat <- matrix(c(3,3,3,2,3,3,3,3,3,3,3,3,2,3,1,2,2,2,3,3), 
              nr=10, byrow=TRUE) 
# here is our lookup table for genotypes
pat <- matrix(1:9, nr=3, byrow=T, dimnames=list(1:3,1:3))

Then

> pat[dat]
 [1] 9 8 9 9 9 9 6 2 5 9

gives you what you want.

However, I would like to say that you might find easier to use dedicated package for genetic studies, like the one found on CRAN (like genetics, gap or SNPassoc, to name a few) or Bioconductor, because they include facilities for transforming/recoding genotype data and working with haplotype.

Here is an example of what I have in mind with the above remark:

> library(genetics)
> geno1 <- as.genotype.allele.count(dat[,1]-1)
> geno2 <- as.genotype.allele.count(dat[,2]-1)
> table(geno1, geno2)
     geno2
geno1 A/A A/B
  A/A   6   1
  A/B   1   1
  B/B   0   1
chl
  • 27,771
  • 5
  • 51
  • 71
5

Andrie's already answered your question by showing a better approach to your problem. But there are a few mistakes in your original code that I want to mention.

First, & is not the same as &&. See ?'&' for more. I believe you wanted to use & in your example.

Second, == is used for tests of equality, which you use correctly initially in your example. It is not used for assignment, which you incorrectly use it for when assigning "9" to x[3]. Assignment is handled by <-, whether inside or outside functions. See ?'==' and ?'<-' for more.

Third, assigning a value to x[3] within the apply() function does not make sense. apply() simply returns an array. It does not modify the OXT object. Below is an example of how your original approach might look. However, Andrie's method is probably better for you.

OXT <- read.table(textConnection(
    "3 3
    3 2
    3 3
    3 3
    3 3
    3 3
    2 3
    1 2
    2 2
    3 3"))
colnames(OXT)<- c("OXT1","OXT2")

OXT$HAP <- apply(OXT, 1, function(x)
    {
        if(x[1] == 3 & x[2] == 3) result <- 9
        else if(x[1] == 3 & x[2] == 2) result <- 8
        else if(x[1] == 3 & x[2] == 1) result <- 7
        else result <- 0
        return(result)
    })
jthetzel
  • 3,603
  • 3
  • 25
  • 38
  • @jhetzel The OP wants to match 9 cases which might render the above series of tests ugly at the end; agree for the rest. – chl May 04 '11 at 17:34
  • 1
    @jhetzel - = can also be used for assignment. Its not normally a good idea, but it can be done. – richiemorrisroe May 04 '11 at 17:39
  • @chl I agree. To be clear, using a series of conditionals is not the best approach. Your and Andrie's approach is the way to go. I only include the apply function for the first three matches above to help Bill better understand why his original code failed. – jthetzel May 04 '11 at 17:50
  • @richiemorrisroe Because using `=` for assignment is not normally a good idea, I did not mention it. For those interested, a quick overview of `<-` versus `=` on SO: http://stackoverflow.com/questions/1741820/assignment-operators-in-r-and – jthetzel May 04 '11 at 17:56
  • Wow! Thank you all for your incredible support. You are a very welcoming crowd. I'm sure it is blatantly obvious, but I am extremely new to R and programming in general and I appreciate everyone taking the time to correct my rookie code and offer solutions that expand my knowledge. Thank you again! – Bill May 06 '11 at 13:19
3

Another approach is to paste the two columns together and make a factor.

df <- structure(list(V1 = c(3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 2L, 3L), 
        V2 = c(3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L)), .Names = c("V1", 
        "V2"), class = "data.frame", row.names = c(NA, -10L))

df$hap <- factor(paste(df$V1, df$V2, sep=""))

Or equivalently,

df$hap2 <- factor(apply(df[1:2], 1, paste, collapse=""))
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • (+1) Yes, good idea, but that will be less easy to transform back to genotype/haplotype data in turn. (I think each column indexes the frequency of minor allele of a DNA sequence + 1, e.g. [SNP](http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism), which might be coded as 1=AA, 2=AB, 3=BB, B being the minor allele.) – chl May 04 '11 at 17:53
  • True; it's perhaps not the best thing in this particular case. If there were more columns that needed to be combined or if the data didn't have such a clear interpretation it might be more appropriate. – Aaron left Stack Overflow May 04 '11 at 18:12