I have 2 columns value and frequency. value 1-6, 7-16, 17-21, 22-51, 52-80,81-110 freq 300,400,300,1200,800,55
How would I construct a data frame of sorts and approximate the mean, median, and mode data for this using a function?
I have 2 columns value and frequency. value 1-6, 7-16, 17-21, 22-51, 52-80,81-110 freq 300,400,300,1200,800,55
How would I construct a data frame of sorts and approximate the mean, median, and mode data for this using a function?
Here's one possible approach using a simulation. You could create fake data that has the desired interval frequencies and look at the result.
Here, I've assumed that each value within the interval ranges is uniformly likely, but in reality you might have domain knowledge that would help guide a more realistic distribution with each interval. For instance, if value
represented ages, then we should expect "81" to be dramatically more likely than "110", even though both are in the same interval. In that case, you might replace the runif
step below with sample
and you specify therein the probabilities of different values. But as a quick back-of-the-envelope, this approach should get you most of the way there.
First, your summary info as code:
df <- data.frame(value = c("1-6", "7-16", "17-21", "22-51", "52-80","81-110"),
freq = c(300,400,300,1200,800,55))
Then we can create data that fits your summary numbers:
library(dplyr); library(tidyr)
set.seed(0)
df %>%
separate(value, c("min", "max"), remove = FALSE, convert = TRUE) %>%
uncount(freq) %>%
rowwise() %>%
mutate(value_random = runif(1, min, max)) %>%
ungroup()
# value min max value_random
# <chr> <int> <int> <dbl>
#1 1-6 1 6 4.90
#2 1-6 1 6 3.62
#3 1-6 1 6 2.23
#4 1-6 1 6 3.67
#...
Then you could get the summary stats you're looking for...
... %>% summarize(mean = mean(value_random),
median = median(value_random))
# output, will vary depending on the random seed set with "set.seed" above
mean median
<dbl> <dbl>
1 36.8 33.9
See here for ways to calculate mode -- not straightforward in base R: How to find the statistical mode?