1

I have a following dataframe:

varnames<-c("ID", "a.1", "b.1", "c.1", "a.2", "b.2", "c.2")

a <-matrix (c(1,2,3,4, 5, 6, 7), 2,7)

colnames (a)<-varnames

df<-as.data.frame (a)


   ID  a.1  b.1  c.1  a.2  b.2  c.2
 1  1    3    5    7    2    4    6
 2  2    4    6    1    3    5    7

I would like to categorize "a.2", "b.2", and "c.2" columns using quartiles of "a.1", "b.1", and "c.1", correspondingly:

cat.a.2<-cut(df$a.2, c(-Inf, quantile(df$a.1), Inf))#categorizing a.2 using quartiles of a.1

cat.a.2
[1] (-Inf,3] (-Inf,3]
Levels: (-Inf,3] (3,3.25] (3.25,3.5] (3.5,3.75] (3.75,4] (4, Inf]

cat.b.2<-cut(df$b.2, c(-Inf, quantile(df$b.1), Inf))# categorizing b.2 using quartiles of b.1

cat.b.2
[1] (-Inf,5] (-Inf,5]
Levels: (-Inf,5] (5,5.25] (5.25,5.5] (5.5,5.75] (5.75,6] (6, Inf]


cat.c.2<-cut(df$c.2, c(-Inf, quantile(df$c.1), Inf))# categorizing c.2 using quartiles of c.1

 cat.c.2
[1] (5.5,7] (5.5,7]
Levels: (-Inf,1] (1,2.5] (2.5,4] (4,5.5] (5.5,7] (7, Inf]

Is there any way to do this task automatically?

I naively experimented with sapply ():

quant.vars<-c("a.1","b.1", "c.1") # creating a vector of the names of variables which quartiles I am going to use
vars<-c("a.2","b.2", "c.2") # creating a vector of the names of variables which I am going to categorize
sapply (vars,FUN=function (x){cut (df [,x], quantile (df[,quant.vars], na.rm=T))})
         a.2        b.2          c.2       
[1,] "(1,3.25]" "(3.25,4.5]" "(5.75,7]"
[2,] "(1,3.25]" "(4.5,5.75]" "(5.75,7]"

Of course, it is not the result I wanted.

Moreover, when add "Inf" to the cut () range I see the following error:

sapply (vars,FUN=function (x){cut (df [,x], c(quantile (df[,quant.vars], Inf), na.rm=T))})

  Error in quantile.default(df[, quant.vars], Inf) : 'probs' outside [0,1]

In summary, my question is how to make R:

  1. Calculate quantiles of variables having suffix 1 (a.1., b.1, c.1)

  2. Recognize pairs of variables having common prefix (a.1 and a.2, b.1 and b.2, c.1 and c.2)

  3. In each pair, to categorize the variable having suffix 2, using quantiles, obtained from the variable having suffix 1 (a.2 categorized by a.1 quantiles, b.2 categorized by b.1 quantiles, c.2 categorized by c.1 quantiles)

Thank you very much

DSSS
  • 1,923
  • 4
  • 16
  • 15

1 Answers1

3

Something like this?

#find duplicated letters
temp <- do.call(rbind,strsplit(names(df)[-1],".",fixed=TRUE))
dup.temp <- temp[duplicated(temp[,1]),]

#loop for cut
res <- lapply(dup.temp[,1],function(i) {
  breaks <- c(-Inf,quantile(a[,paste(i,1,sep=".")]),Inf)
  cut(a[,paste(i,2,sep=".")],breaks)
})

#make list a data.frame
res <- do.call(cbind.data.frame,res)
names(res) <- paste("cut",dup.temp[,1],2,sep=".")

#    cut.a.2  cut.b.2 cut.c.2
# 1 (-Inf,3] (-Inf,5] (5.5,7]
# 2 (-Inf,3] (-Inf,5] (5.5,7]

res[,1]
# [1] (-Inf,3] (-Inf,3]
# Levels: (-Inf,3] (3,3.25] (3.25,3.5] (3.5,3.75] (3.75,4] (4, Inf]

If speed is an issue, there is room for optimization.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • The code works excellent, I am still trying to understand the loop. I wish I can ever write anything as smart as this code. Thanks again! – DSSS Apr 23 '13 at 16:08
  • The code works perfectly with the example I provided, but with my real dataframe the loop gives an error: "Error in cut.default(newdata.1[, paste(i, 1, sep = ".")], breaks) : 'breaks' are not unique". I have no idea why it happens, as the structures of dataframes are similar, I just made the example smaller and with brief variable names for simplicity. Could you please suggest possible reason(s) for this error? Thank you very much. – DSSS Apr 24 '13 at 05:23
  • Possibly, there is an `.1` with all equal values. E.g., look at `quantile(rep(1,5))`. You could test for that inside the anonymous function and handle it somehow. – Roland Apr 24 '13 at 06:45
  • Thank you very much, Roland, I have also asked posted question here http://stackoverflow.com/questions/16184947/cut-error-breaks-are-not-unique and got an explanation that this is because one variable has several quantiles with the same value. The problem is fixed by putting cut () function inside "Unique ()" function. – DSSS Apr 24 '13 at 15:05