how to get top 100 count number for each cell in ggplot2 with geom_bin2d

Question

Before asking, I have read this post, but mine is more specific.

library(ggplot2)
library(scales)

set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))

I replace my real data with dat, the domain of x and y is [-4,4] at this random seed, and I partition the area into 256(16*16) cells, the interval of which is 0.5. For each cell, I want to get the count numbers.

Yeah, it's quite easy, geom_bin2d can solve it.

# plot
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d() 

# Get data - this includes counts and x,y coordinates 
newdat <- ggplot_build(p)$data[[1]]

# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2, 
                  label=count), col="white")

So far so good, but I only want to get top 100 count numbers and plot in the pic, like pic below.

After reading ?geom_bin2d, drop = TRUE only removes all cells with 0 counts, and my concern is the top 100 counts. What should I do, this is question 1.

Please take another look on the legend of the 2nd pic, the count number is small and close, what if it's 10,000, 20,000, 30,000.

The method is use trans in scale_fill_gradient, the built_in function are exp, log, sqrt, and so on, but I want to divide 1,000. Then, I found trans_new() in package scales and had a try, but negative.

sci_trans <- function(){ trans_new('sci', function(x) x/1000, function(x) x*1000)}
p + scale_fill_gradient(trans='sci')

And, this is question 2. I have googled a lot, but cannot find a way to solve it, thanks a lot for anyone who does me a favor, thank you!

@user20650 I have read your answer of this [post](http://stackoverflow.com/questions/28771018/getting-counts-on-bins-in-a-heat-map-using-r) , could you do me a favor? Thank you — Ling Zhang, Nov 30 '16 at 11:14
Related: [How to use stat_bin2d() to compute counts labels in ggplot2?](http://stackoverflow.com/questions/27476327/how-to-use-stat-bin2d-to-compute-counts-labels-in-ggplot2) where @MrFlick 's comment quotes Hadley from 2010: "he basically says you can't use stat_bin2d, you'll have to do the summarization yourself". Neither stat_bin2d nor stat_summary_2d seem to expose their output bins and counts. — smci, Nov 30 '16 at 20:29

score 0 · Answer 1 · edited May 23 '17 at 12:24

0

Apparently you can't get the output bins or counts from stat_bin2d or stat_summary_2d ; according to a related question: How to use stat_bin2d() to compute counts labels in ggplot2? where @MrFlick 's comment quotes Hadley from 2010: "he basically says you can't use stat_bin2d, you'll have to do the summarization yourself".

So, the workaround: create the coordinate bins manually yourself, get the 2D counts, then take top-n. For example, using dplyr:

dat %>% mutate(x_binned=some_fn(x), y_binned=some_fn(y)) %>%
        group_by(x_binned,y_binned) %>% # maybe can skip this line
        summarize(count = count()) %>% # NOTE: no need to sort() or order()
        top_n(..., 100)

You might have to poke into stat_bin2d in order to copy (or call) their exact coordinate-binning code. UPDATE: here's the source for stat-bin2d.r

StatBin2d <- ggproto("StatBin2d", Stat,
  default_aes = aes(fill = ..count..),
  required_aes = c("x", "y"),

  compute_group = function(data, scales, binwidth = NULL, bins = 30,
                           breaks = NULL, origin = NULL, drop = TRUE) {

    origin <- dual_param(origin, list(NULL, NULL))
    binwidth <- dual_param(binwidth, list(NULL, NULL))
    breaks <- dual_param(breaks, list(NULL, NULL))
    bins <- dual_param(bins, list(x = 30, y = 30))

    xbreaks <- bin2d_breaks(scales$x, breaks$x, origin$x, binwidth$x, bins$x)
    ybreaks <- bin2d_breaks(scales$y, breaks$y, origin$y, binwidth$y, bins$y)

    xbin <- cut(data$x, xbreaks, include.lowest = TRUE, labels = FALSE)
    ybin <- cut(data$y, ybreaks, include.lowest = TRUE, labels = FALSE)

    ...

  }

bin2d_breaks <- function(scale, breaks = NULL, origin = NULL, binwidth = NULL,
                      bins = 30, right = TRUE) {
  ...

(But this seems a worthy enhance request on ggplot2, if it hasn't already been filed.)

edited May 23 '17 at 12:24

Community

1
1

answered Nov 30 '16 at 20:31

smci

32,567
20
113
146

Thanks, @smci `x_binned=some_fn(x), y_binned=some_fn(y)` you mentioned is something about coordinate-binning code, right? And `head(...,100)`, you want me to order and get head 100? Get it and I will have a try. But in ggplot2, there is a direct way to get the count – Ling Zhang Dec 01 '16 at 00:25
By `some_fn()`, I meant you just write/copy some fn which bins them, with the same bin sizes as `stat_bin2d`, which will involve looking at the code a little. I couldn't see how at a quick glance. Also yes, I forgot `order(count) %>% ...` And maybe you have to wrap `head(...)` with `do()`. You get the idea though. – smci Dec 01 '16 at 03:38
ok, got it, I have tried another way, according to the `newdat <- ggplot_build(p)$data[[1]]`. I order `newdat` and add a column `rank', then add the top-100 `rank` to the text, not the `count`. By the way, How about the question 2, the count is so big, like 10,000, 20,000, 30,000. I want to transform by dividing 1,000, the built_in trans are `sqrt`, `log` and so on, I write my own `sci_trans`, but it does not work – Ling Zhang Dec 01 '16 at 07:37
You could do `summarize(count = count()/1000)`, `log(count)`, or whatever you want. Maybe use `log1p()` so it doesn't blow up on 0. – smci Dec 01 '16 at 09:42
Thank you, I understood what you mean – Ling Zhang Dec 02 '16 at 00:56
UPDATE: a) I posted you the source link for stat-bin2d.r , now you get to dig into it, or single-step it. b) I noticed one improvement, to avoid operationally sorting a large number of bins: instead of `order() %>% head()`, just directly take `top_n(..., 100)` – smci Dec 02 '16 at 01:24
Thank you, after I get the `x_binned`, `y_binned` and `count`, but how do I plot each cell with count, `x_binned`, `y_binned` reflect only a point? – Ling Zhang Dec 02 '16 at 01:50
As the [post](http://stackoverflow.com/questions/6414521/can-ggplot-make-2d-summaries-of-data/8018567#8018567) of @kohske , the answer was cool, but it does not work now as `proto` replaced by`ggproto`. So I have tried to rewrite these codes, but too many mistakes to correct, I cannot finish it now. – Ling Zhang Dec 02 '16 at 01:55

how to get top 100 count number for each cell in ggplot2 with geom_bin2d

1 Answers1