0

I have a dataframe (df) with a categorical variable (CHR, 22 levels) and a continuos variable (POS, for chromosomal position, varying among CHR levels). I want to generate an additional categorical variable based on ranges for POS, which has to be generated based on POS values for each CHR level, and the range is equally sized, for example, let's suppouse this is the df:

CHR POS
1   2
1   4
1   6
.   .
.   .
1   30
.   .
.   .
.   .
22  150
22  162
22  170
22  185

So I tried to split first the df by using:

> df_split <- split(df, f=df$CHR)

# then I generate a function, involving "cut" function

> bins <- function(df){
  lower <- min(df$POS)
  upper <- max(df$POS)
  cut(df$POS, seq(lower,upper, 10))
}

# finally i used lapply, incorporating my personalizad "cut" function

> bin_1 <- lapply(df_split, bins)

The problem is that cut function is not working
Thanks for any help!

  • What exactly do you mean when you say "the cut function is not working". Are you getting an error? Include a proper [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and the desired output for that sample input. – MrFlick Mar 23 '17 at 02:15
  • I supouse to get a list of vectors with a interval for each CHR level, but I 'm currently getting a list of similar intervals. The desired output should be a list of ranges or intervals for each level in CHR variable, equally sized but taking into account the maximum and minimum values in the POS variables (nested by CHR levels). CHR POS New_VAR 1 1 1-3 1 2 1-3 1 3 1-3 1 4 4--6 1 5 4-6 1 6 4-6 1 7 7-9 1 8 7-9 2 11 11-13 2 12 11-13 2 13 11-13 2 14 14-16 2 15 14-16 2 16 14-16 2 17 17-19 2 18 17-19 2 19 17-19 – Angel Criollo Rayo Mar 23 '17 at 02:53
  • Your code works generally fine for me. I get pretty much exactly what you are expecting when I run what you have above. – thelatemail Mar 23 '17 at 03:10
  • yes, partly is fine, the thing is the code above is not generating the correct intervals I need, and I don't know why the sequence I gave to the "cut" function is not detected. – Angel Criollo Rayo Mar 23 '17 at 03:23
  • @AngelCriolloRayo - can you be a bit more specific about what doesn't work and what you expect? Do you just need to specify `cut(..., include.lowest=TRUE)` or something? – thelatemail Mar 23 '17 at 03:58
  • I spect the POS variable to be cut in different intervals (in bins of 1E6) but nested according to CHR variable. What I'm getting is this for example, which is not even a range or desired intervals :(6.11e+04,1.06e+06] (6.11e+04,1.06e+06] (6.11e+04,1.06e+06] (6.11e+04,1.06e+06] (6.11e+04,1.06e+06] – Angel Criollo Rayo Mar 23 '17 at 04:19
  • @ thelatemail , thanks for your comment, I realised the intervals are quite good, the df is so large , so the first lines belong to the same interval. – Angel Criollo Rayo Mar 23 '17 at 05:26

0 Answers0