How to compare dataset with ggplot2 geom_density()

Question

this is an extension to the question I asked previously:
How to extract the density value from ggplot in r

This fruit dataset is in fact the data for country A, and now that I have another dataset for country B. I would like to compare their values. However, the density plot (the y-axis) for fruit apple for country A and country B are different, where country A has its highest density around 0.8 and country B has its at around 0.4.

example country A:

Q. Country B has similar curve but its highest density value of the y-axis is only 0.4. So how can I compare them?

Code for minimal example:

library(ggplot2) 
set.seed(1234) 
df = data.frame(
    fruits = factor(rep(c("Orange", "Apple", "Pears", "Banana"), each = 200)),
    weight = round(c(rnorm(200, mean = 55, sd=5),
                     rnorm(200, mean=65, sd=5),
                     rnorm(200, mean=70, sd=5),
                     rnorm(200, mean=75, sd=5)))
) 

dim(df) #[1] 800   2
    
ggplot(df, aes(x = weight)) + 
  geom_density() + 
  facet_grid(fruits ~ ., scales = "free", space = "free")
    
g = ggplot(df, aes(x = weight)) + 
  geom_density() + 
  facet_grid(fruits ~ ., scales = "free", space = "free")
    
p = ggplot_build(g)
    
sp = split(p$data[[1]][c("x", "density")], p$data[[1]]$PANEL)
apple_df = sp[[1]]
    
sum(apple_df$density ) # this is equal to 10.43877 but i want it to be one

be aware that density curves are not necessarily a "true" representation of your underlying data, as there is a lot of estimation involved which depends on many factors, e.g. the bandwidth, and the kernel used for estimation (your curves change depending on those factors!). Other curves such as cumulative probability curves or quantile quantile curves are more accurate representations of the underlying data (they don't change based on some factors), thus might be better for comparison — tjebo, Oct 28 '21 at 11:34
@tjebo Hello, i am starting to confuse about the density plot; for example, given the apply density plot showing above, when i did the integration for the whole data `integrate.xy(apple_df$x, apple_df$density)` i only got ~ 0.9488308, it that due to the estimation that you mentioned? — Math Avengers, Oct 29 '21 at 03:41

Leonardo · Answer 1 · 2021-10-27T12:45:09.647

Suppose you have two dataframes for two different countries df_c1 and df_c2. The idea is to merge the two dataframes and add a column to differentiate the country

library(dplyr)
library(ggplot2)

df_c1 = data.frame(
  fruits = factor(rep(c("Orange", "Apple", "Pears", "Banana"), each = 200)),   
  weight = round(c(rnorm(200, mean = 55, sd=5),
                   rnorm(200, mean=65, sd=5), 
                   rnorm(200, mean=70, sd=5), 
                   rnorm(200, mean=75, sd=5)))
)

df_c2 = data.frame(
  fruits = factor(rep(c("Orange", "Apple", "Pears", "Banana"), each = 200)),   
  weight = round(c(rnorm(200, mean = 20, sd=3),
                   rnorm(200, mean=35, sd=6), 
                   rnorm(200, mean=40, sd=2), 
                   rnorm(200, mean=15, sd=4)))
)


df <- rbind(
  df_c1 %>% mutate(country = "country 1"), 
  df_c2 %>% mutate(country = "country 2")
)


df %>% 
  ggplot() + 
  geom_density(aes(x = weight, color = country)) +
  facet_grid(fruits ~ ., scales = "free", space = "free")

Area under curve

Another possibility for working with distributions is to first use the density function and then represent those values.

dens1 <- df_c1 %>% 
  group_by(fruits) %>% 
  summarise(x = density(weight)$x, y = density(weight)$y) %>% 
  mutate(country = "country 1")

dens2 <- df_c2 %>% 
  group_by(fruits) %>% 
  summarise(x = density(weight)$x, y = density(weight)$y) %>% 
  mutate(country = "country 2")

df_dens <- rbind(dens1, dens2)

Now in ggplot we use geom_line

df_dens %>% 
  ggplot() +
  geom_line(aes(x, y, color = country)) + 
  facet_grid(fruits ~ ., scales = "free", space = "free")

If you want to measure the area under the curve, define the differential.

We choose only one curve, for example country == "country 1 and fruits == "Apple"

df_single_curve <- df_dens %>% 
  filter(country == "country 1" & fruits == "Apple")

# differential
xx <- df_single_curve$x
dx <- xx[2L] - xx[1L]
yy <- df_single_curve$y

# integral
I <- sum(yy) * dx
I
# [1] 1.000965

How to compare dataset with ggplot2 geom_density()

1 Answers1