I have a very large data file (>300k rows) and each row is part of a unique sample (>3000 samples). I want to generate a kernel density estimator for each individual sample and extract relevant information (minimum value, maximum value, maximum probability of density estimator, median of density estimator, mean of density estimator) into a separate table along with the sample name.
I have tried to extract information from the ggplot
function stat_density_ridges()
using the approaches laid out here
Adding a mean to geom_density_ridges and here draw line on geom_density_ridges which extract data from stat_density_ridges
and ggplot_build
with purrr::pluck
but it doesn't provide all the information that I want.
The following generates some synthetic data similar to what I want:
set.seed(1)
x = runif( 50, max = 40, min = 20 )
set.seed(2)
y = runif( 50, max = 300, min = 100 )
sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
d <- data.frame( x, y , sample.number )
And the plot in ggplot
that shows the distribution:
ggplot( data = d, aes( x = x, y = as.factor( samples ) ) ) +
labs( x = expression( paste( "x" ) ),
y = expression( paste( "sample number" ) ) ) +
stat_density_ridges()
I would like to end up with a data table with the following information:
sample.name
, max(x)
, min(x)
, maximum height of the kernel density estimator and it's x
location, median height of the kernel density estimator and it's x
location, etc.
The only thing I can think to do is to create a long and arduous loop
sample.numbers <- rep( NA, times = max( d$sample.number ) )
max.x <- rep( NA, times = max( d$sample.number ) )
min.x <- rep( NA, times = max( d$sample.number ) )
for( i in 1:max( d$sample.number ) ) {
temp.d = d[ d$sample.number == i, ]
sample.numbers[ i ] = i
max.x[ i ] = max( temp.d$x )
min.x[ i ] = min( temp.d$x )
}
and then somehow adding in a bit that creates a density estimator and extracts the information from that. I'm guessing the indexing in R presents an easier way to get through this for the many thousands of samples I have while using group_by
, but I cannot figure it out. Note, I'm still having trouble getting my head around piping in R, so some simple explaining might be required if solutions have that in them.