0

I am very new to R but I am interested in learning more and improving.

I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'. I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't. Here is what I wrote:

    for (i in 1:40019)  {
      {if (FinalData$Length[i] <=10) 
        {FinalData$`<10`[i]<-1
      } else {FinalData$`<10`[i]<-0}} #Fills '<10'
      if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
        FinalData$`10-15`[i]<-1
      } else{FinalData$`10-15`[i]<-0} #Fills'10-15'
      if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
        FinalData$`15-20`[i]<-1
      } else{FinalData$`15-20`[i]<-0} #Fills '15-20'
      if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
        FinalData$`20-25`[i]<-1
      } else{FinalData$`20-25`[i]<-0} #Fills '20-25'
      if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
        FinalData$`25-30`[i]<-1 
      } else{FinalData$`25-30`[i]<-0} #Fills '25-30'  
      if(FinalData$Length[i] >=30){
        FinalData$`>30`[i]<-1 
      } else{FinalData$`>30`[i]<-0} #Fills '>30'  
   }

This seems to work, but it takes a long time:

    system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
      user  system elapsed 
     94.408  19.147 118.203 

The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for? Here is an example of some of the values I am testing: 'Length': 14.362, 12.482337, 8.236, 16.752, 12.045 If I am not being clear about how the dataframe is structured, here is a screenshot: How my data frame is organized

Justin
  • 42,475
  • 9
  • 93
  • 111
Nick
  • 95
  • 1
  • 2
  • 8
  • 1
    Instead of creating an image of your data, you should include the data. – steveb Mar 08 '16 at 03:43
  • 3
    Look at `?cut` - e.g. `cut(c(14.36,12.48,8.26), c(0,10,15))` – thelatemail Mar 08 '16 at 04:06
  • You didn't need to "pre-create" these columns. A single `cut` like @thelatemail suggests would create a single factor column with the right labels. If you then really want it expanded out into dummy variables `model.matrix` would do that for you in one line. You could also probably work something out with `reshape2::dcast` instead of `model.matrix`. If you [share some data reproducibly](http://stackoverflow.com/q/5963269/903061), (either share the code for simulation (preferred) or use `dput()`) I'm sure someone can be more explicit. – Gregor Thomas Mar 08 '16 at 06:28
  • Does it have be 6 columns like that? Surely a single column with the headings of those 6 as character or factor would be equally if not more useful. – JeremyS Mar 08 '16 at 06:40
  • @JeremyS I want the 6 columns because I am comparing the proportion of neuron segments in each category. I will split the data set into different treatment groups and compare the proportion of segments between treatment groups. This is how I did it when I was using excel, but I know that was probably not the most efficient way. In the end I am making histograms using the proportions for each group and comparing them to show a shortening or lengthening trend. – Nick Mar 09 '16 at 13:33
  • 1
    @Nick Then a single column of factors like I have in my answer will help you later on. You can specify the order that factors are plotted using `levels`. – JeremyS Mar 10 '16 at 06:39

2 Answers2

1

You can use the cut function in R. It is used to convert numeric values to factors:

x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)

set right = "T" or "F" as per your need.

tushaR
  • 3,083
  • 1
  • 20
  • 33
  • Thank you @Tushar for this tip. Using 'cut' is a big help. I also used this link to clear up the details. http://www.r-bloggers.com/r-function-of-the-day-cut/ – Nick Mar 09 '16 at 13:13
0

You can vectorise that as follows (I made a sample of some data called DF)

DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
  x[x >= 10 & x < 15] <- "10-15"
  x[x >= 15 & x < 20] <- "15-20"
  x[x >= 20 & x < 25] <- "20-25"
  x[x >= 25 & x < 30] <- "25-30"
  x[x > 30] <- ">30"
  x[x < 10] <- "<10"
  return(x)
}
DF$Group <- MyFunc(DF[,3])

If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.

Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.

e.g.

DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))
JeremyS
  • 3,497
  • 1
  • 17
  • 19
  • Manual yes, but highly adaptable to a lot of different situations, not just numeric cases. – JeremyS Mar 10 '16 at 06:40
  • Thank you, once I tried this it gave me exactly what I needed. I used your function then I made summary(as.factor(y)) to get a frequency table for the categories that I could then graph. – Nick Mar 13 '16 at 20:36