-2

I am new to R, and I've been facing this problem from quite some time. Whenever I try to make deciles or quartiles using 'dplyr' package, my deciles get merged into fewer groups. Like I want 10 different groups, whereas I only get 6, 4 or sometimes only 3. I know R tries to group/merge small deciles if it has lesser data. But I want to avoid this problem. Please help! Thanks!!

the code is:

 mydata <- data.frame(col1= c(0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5,3,12,5,65,23,65984,21,5469,321,6,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,5233,18000))

DecLocations <- quantile(mydata$col1, probs = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9))
mydata$decile <- findInterval(mydata$col1,c(-Inf,DecLocations, Inf))

require(dplyr)
mydata$decile<-factor(mydata$decile)
decile_grp<-group_by(mydata,decile)
decile_summ_test<-summarize(decile_grp, total_cnt=sum(col1))
decile_summ_test<-arrange(decile_summ_test, desc(decile))
View(decile_summ_test)

In here I'm only getting first 6 Deciles because R merges the small deciles. This is what I'm trying to avoid in here. I am expecting to get all 10 deciles, even if they have really small numbers.

PerryThePlatipus
  • 61
  • 1
  • 1
  • 5
  • Minimal reproducible example? – CPak Oct 12 '17 at 11:38
  • 1
    You should provide a simple data example that illustrates your problem and what you expect as output. Please do not add code in the comments; rather, edit your original post with your code/example/updates. – CPak Oct 12 '17 at 11:51
  • @CPak done that now! – PerryThePlatipus Oct 12 '17 at 12:24
  • @PerryThePlatipus No you haven't. Reproducible means we should be able to take your code and run it. We don't have your data so we can't run it. But it looks like you're predicting something that only has three outcomes so... what exactly do you expect it to look like? – Dason Oct 12 '17 at 12:31
  • @Dason I'm sorry, not able to make a reproducible example at the moment. I'm expecting it to look something like this: `**decile** total_cnt ======== **10** 10480 **9** 6158 **8** 2346 **7** 846 **6** 564 **5** 345 **4** 154 **3** 125 **2** 84 **1** 56` – PerryThePlatipus Oct 12 '17 at 12:45
  • And are 10, 9, ..., 1 values that you see in your data? – Dason Oct 12 '17 at 13:14
  • And you absolutely should be able to create a reproducible example. We don't need your full data. Give us some made up data that gives the same issue you're seeing. Read this for more info: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Dason Oct 12 '17 at 13:15
  • Thanks for showing @Dason . I have tried to do that now, could you please check and help me with it now :) Sorry didn't have any idea how to do that. – PerryThePlatipus Oct 12 '17 at 14:19
  • @CPak I have added the reproducible code, if you want to check now. Sorry I didn't know how to do that before. – PerryThePlatipus Oct 12 '17 at 14:20
  • 1
    When you say you expect to get all the results even if they have really small numbers - are you including 0 observations as a small number? Because that's the reason they're being dropped - there aren't any observations in those groups. – Dason Oct 12 '17 at 14:43

1 Answers1

0

If you have a look at your DecLocations vector you see that R creates all deciles, but when you apply the findInterval function with the deciles as breakpoints then the lower deciles are dropped due to the definition of the findInterval function (see ?findInterval)

Part of the help-file:

Description

Given a vector of non-decreasing breakpoints in vec, find the interval containing each element of x; i.e., if i <- findInterval(x,v), for each index j in x v[i[j]] ≤ x[j] < v[i[j] + 1] where v[0] := - Inf, v[N+1] := + Inf, and N <- length(v). At the two boundaries, the returned index may differ by 1, depending on the optional arguments rightmost.closed and all.inside.

Here you can see that the function finds the max. j such that v[i[j]] ≤ x[j] < v[i[j] + 1]. That's the reason the deciles are dropped.

If you want to have all deciles represented in your vector, you would have to assign the 0 in some (random?) way to the lower deciles.

DecLocations <- quantile(mydata$col1, probs = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9))
DecLocations
 10%  20%  30%  40%  50%  60%  70%  80%  90% 
   0    0    0    0    5   65  400  900 1400 

mydata$decile <- findInterval(mydata$col1,c(-Inf,DecLocations, Inf))
head(mydata)
  col1 decile
1    0      5
2    0      5
3    0      5
4    0      5
5    0      5
6    0      5
kath
  • 7,624
  • 17
  • 32