Create groups based on percent_rank in dplyr

Question

I am trying to create some groups based on the percent rank of some values in dplyr.

The code below creates a data frame and then sapply a function to determine the groups. The downside is that I can't get sapply to work for tbl_postgres, only data frames. So I'm curious if there is another solution for this.

I had considered something with ntile, but the groups I want to create have some arbitrary cut-offs. Also, I have not had much luck getting it to work with dplyr (maybe pure sql might work).

library(dplyr)

n <- 100

df1 <- data.frame(idx = 1:n, x = rnorm(n))

df1 <- df1 %>%
            arrange(x) %>%
            mutate(pc_x = percent_rank(x))

index <- function(x) {
    if (x < 0) {
        return(NA)
    } else if (x < 0.3) {
        return(1)
    } else if (x < 0.7) {
        return(2)
    } else if (x <= 1) {
        return(3)
    } else {
        return(NA)
    }
}

df1 <- df1 %>%
            mutate(group = sapply(pc_x, index))

After seeing the answer, this is related to the question here: http://stackoverflow.com/questions/23163567/r-dplyr-categorize-numeric-variable-with-mutate — John, Nov 06 '15 at 18:37

score 4 · Accepted Answer · answered Nov 06 '15 at 18:33

4

Perhaps cut will serve your needs:

library(dplyr)
n <- 100
set.seed(42)
df1 <- data.frame(idx = 1:n, x = rnorm(n))
df1 <- df1 %>%
    arrange(x) %>%
    mutate(pc_x = percent_rank(x))

I use -1e9 in breaks because cut is "left-open", so if I used breaks <- c(0, ...) then the first row would be NA instead of 1.

breaks <- c(-1e9, 0.3, 0.7, 1)
df1 %>%
    mutate(grp = cut(pc_x, breaks=breaks, labels=FALSE)) %>%
    group_by(grp)
## Source: local data frame [100 x 4]
## Groups: grp [3]
##      idx          x       pc_x   grp
##    (int)      (dbl)      (dbl) (int)
## 1     59 -2.9930901 0.00000000     1
## 2     18 -2.6564554 0.01010101     1
## 3     19 -2.4404669 0.02020202     1
## 4     39 -2.4142076 0.03030303     1
## 5     22 -1.7813084 0.04040404     1
## ..   ...        ...        ...   ...

answered Nov 06 '15 at 18:33

r2evans

141,215
6
77
149

Thanks. This is precisely what I wanted. And it turns out that my question is a duplicate. – John Nov 06 '15 at 18:36
I was hitting myself for why I couldn't find the cut function in dplyr until I realized that is was from base R. One downside to your approach is that since it is not in dplyr it doesn't work for anything other than local data frames. – John Nov 06 '15 at 18:48
True ... and since I am about to (but have not yet) play with dplyr in a database environment, I didn't know that. Perhaps you could suggest an issue on [dplyr github](https://github.com/hadley/dplyr/issues) to extend `ntile` to accept arbitrary breaks? It seems like a plausible use-case, though I'm not certain about the underlying SQL translation. – r2evans Nov 06 '15 at 18:55
In a comment to this question, the author of dplyr has already said that cut is on the list. http://stackoverflow.com/questions/21714457/is-cut-style-binning-available-in-dplyr – John Nov 06 '15 at 19:02
Unfortunately that was 21 months ago, and the [current milestones](https://github.com/hadley/dplyr/milestones) don't include anything reference cuts, breaks, or similar enhancements to `ntile`. – r2evans Nov 06 '15 at 19:07
There is an implementation for databases in dplrr package: https://github.com/hoxo-m/dplyrr/tree/master/R – fc9.30 Feb 23 '16 at 08:09
I see you [started the process](https://github.com/hadley/dplyr/pull/1225) of submitting PRs to [dplyr](https://github.com/hadley/dplyr). Is there something that would facilitate/accelerate getting subsequent PRs made? I know it's not an automatic process, just asking if there have been roadblocks to moving forward. (I'm encouraged by the the fact that Hadley [started with](https://github.com/hoxo-m/dplyrr/issues/1) *"I don't see anything here that shouldn't be in dplyr"*, a good sign that it should and most likely will be accepted.) – r2evans Feb 23 '16 at 11:21

Steven Beaupré · Answer 2 · 2015-11-09T23:23:21.553

3

As per suggested by @joranE and @krlmlr in response to the issue you posted on GitHub, you could build your own custom sql query using sql():

library(dplyr)
library(microbenchmark)

n <- 100
set.seed(42)
df <- data.frame(idx = 1:10e5, x = rnorm(n))
copy_to(my_db, df, "df")

mbm <- microbenchmark(
joranE = tbl(my_db, sql("
  SELECT x,
    CASE
      WHEN x > 0   AND x <= 0.3 THEN 1
      WHEN x > 0.3 AND x <= 0.6 THEN 2
      WHEN x > 0.6 AND x <= 1   THEN 3
      ELSE NULL
    END
    FROM df")),
krlmlr = tbl(my_db, sql("
  SELECT x,
    CASE
      WHEN x <= 0.3 THEN
        CASE WHEN x <= 0 THEN NULL
        ELSE 1
        END
      ELSE
        CASE WHEN x <= 0.6 THEN 2
        WHEN x <= 1 THEN 3
        ELSE NULL
      END
    END
    FROM df")),
times = 100
)

Both methods yield similar results:

#Unit: milliseconds
#   expr      min       lq     mean   median       uq       max neval cld
# joranE 3.070625 3.118589 3.548202 3.206681 3.307202 30.688142   100   a
# krlmlr 3.058583 3.109567 3.250952 3.205483 3.278453  3.933817   100   a

edited Nov 09 '15 at 23:23

answered Nov 06 '15 at 23:21

Steven Beaupré

21,343
7
57
77

I finally had a chance to look a bit closer at this. I had seen it on the github page (which I only filed because of the suggestion above), but thanks for posting it here. One very helpful adjustment is adding an AS statement near the last END. Otherwise it is hard to refer to the new variable. – John Nov 09 '15 at 23:09
1

@John - Your edit to this answer was approved by the original poster, but please avoid making significant code changes in the future. Suggested code changes should be made in the comments, not through edits because it is changing the content of the post. – skrrgwasme Nov 09 '15 at 23:10
@skrrgwasme Clicked "accept" by inadvertance, rolled back. – Steven Beaupré Nov 09 '15 at 23:15
@StevenBeaupré May I ask why you rolled it back? – John Nov 09 '15 at 23:16
@John Because you made the code not reproducible for others. – Steven Beaupré Nov 09 '15 at 23:19
@StevenBeaupré Fine, I'll just put in comments that the code above has bugs. You refer to n and my_db without defining them anywhere. And it probably makes more sense to change the rnorm to runif (though I made a slightly different change). – John Nov 09 '15 at 23:22
1

@John Don't get me wrong. Your efforts are appreciated, but I think they're misplaced here. Edits should be used to improve formatting, spelling, grammar, word choice, etc - things that improve readability and understanding, but that don't change the content. Your change altered the code's functionality; that's a significant change that should be a comment, not an edit. – skrrgwasme Nov 09 '15 at 23:24
When I search I can't find documentation about using raw sql with dplyr tbls. In my actual problem, I have an existing postgresql database that I make a number of manipulations to using dplyr. One of those manipulations is this adjustment. So I already have a tbl_postgres and I need to run this sql command in addition to the others. I'm sort of stuck at this point. – John Nov 09 '15 at 23:39

Create groups based on percent_rank in dplyr

2 Answers2