0

I am trying to create a histogram using geom_histogram() that uses a numeric variable for both the x and y axis.

The numeric x axis will be bucketed and the numeric x axis will show the sum of some other numeric value for each bucket. Right now, I am not having any luck and was hoping someone could help.

attach(Pre_vitality_HZ_Data)

buckets_pre = seq(min(Pre_V_HzR),max(Pre_V_HzR)+1,0.05)

ggplot() + 
    geom_histogram(alpha = 0.2, aes(x=Pre_V_HzR, y = sum(Policy_Count)), bins = length(buckets), fill = 'aquamarine3')

`

Tom
  • 4,257
  • 6
  • 33
  • 49
Zak Ray Sick
  • 95
  • 1
  • 8
  • 1
    Can you give an example that doesn't require your `Pre_vitality_HZ_Data`? – tim_yates Mar 06 '17 at 15:56
  • I don't know how to provide an example but hopefully what i say below can add some context. Let the x variable(pre_v_hzr) be a continuous variable from 0-5. It will be bucketed every 0.5 meaning 0-0.5, 0.5 - 1, and so on. Let the y variable(Policy_count) just be a weight variable that can take on any continuous number. With that being said, one row in the data set does not necessarily have weight 1. Lets assume we we have two records, both of which fall into the X bin 0-.5. Lets say the sum of the y values for these records is 2. I want the height of the bar to be 2 on the chart. – Zak Ray Sick Mar 06 '17 at 16:01
  • 1
    A histogram is a univariate plot. If you have x and y values, perhaps you want a barplot instead. See [how to create a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for tips on providing sample input. Be sure to describe the desired output as clearly as possible. – MrFlick Mar 06 '17 at 16:05
  • All I want to do is sum another variable in my data frame instead of showing the count for each bin. This really can't be done with ggplot? – Zak Ray Sick Mar 06 '17 at 16:24
  • @ZakRaySick, Almost certainly it can be done. Speaking for myself however, I am just not understanding what you need to accomplish. Both x and y are continuous variables? And you want to bin by x ranges and then sum all y-values within each bin? – bdemarest Mar 07 '17 at 03:45

1 Answers1

0

To make the plot you want with ggplot2, it's necessary to prepare the data before plotting. In the solution below, I propose dividing the continuous x-variable into a discrete variable with cut(), and using aggregate() to sum the y-values for each bin of x-values. Besides the base R function aggregate, there are many ways to summarize, aggregate and reshape your data. You may wish to look into the dplyr package or data.table package (two very powerful, well supported packages).

library(ggplot2)

# Use the built-in data set `mtcars` to make the example reproducible.
# Run ?mtcars to see a description of the data set.

head(mtcars)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Let's use `disp` (engine displacement) as the x-variable
# and `mpg` (miles per gallon) as the y-variable.

# Bin the `disp` column into discrete variable with `cut()`
disp_bin_edges = seq(from=71, to=472, length.out=21)
mtcars$disp_discrete = cut(mtcars$disp, breaks=disp_bin_edges)

# Use `aggregate()` to sum `mpg` over levels of `disp_discrete`,
# creating a new data.frame.
dat = aggregate(mpg ~ disp_discrete, data=mtcars, FUN=sum)

# Use `geom_bar(stat="identity") to plot pre-computed y-values.
p1 = ggplot(dat, aes(x=disp_discrete, y=mpg)) +
     geom_bar(stat="identity") +
     scale_x_discrete(drop=FALSE) +
     theme(axis.text.x=element_text(angle=90)) +
     ylab("Sum of miles per gallon") +
     xlab("Displacement, binned")

# For this example data, a scatterplot conveys a clearer story.
p2 = ggplot(mtcars, aes(x=disp, y=mpg)) +
     geom_point(size=5, alpha=0.4) +
     ylab("Miles per gallon") +
     xlab("Displacement")

library(gridExtra)
ggsave("plots.png", arrangeGrob(p1, p2, nrow=1), height=4, width=8, dpi=150)

enter image description here

bdemarest
  • 14,397
  • 3
  • 53
  • 56