0

I have 2 columns value and frequency. value 1-6, 7-16, 17-21, 22-51, 52-80,81-110 freq 300,400,300,1200,800,55

How would I construct a data frame of sorts and approximate the mean, median, and mode data for this using a function?

phen
  • 1
  • 1
    I don't think a generalized approach exists, since the result will depend on your assumptions. For instance, is a value in the 81-110 interval equally likely to be any of those values? If this were, say, ages, that would be a bad assumption. – Jon Spring Sep 17 '21 at 00:40
  • Please provide enough code so others can better understand or reproduce the problem. – Community Sep 18 '21 at 02:00

1 Answers1

1

Here's one possible approach using a simulation. You could create fake data that has the desired interval frequencies and look at the result.

Here, I've assumed that each value within the interval ranges is uniformly likely, but in reality you might have domain knowledge that would help guide a more realistic distribution with each interval. For instance, if value represented ages, then we should expect "81" to be dramatically more likely than "110", even though both are in the same interval. In that case, you might replace the runif step below with sample and you specify therein the probabilities of different values. But as a quick back-of-the-envelope, this approach should get you most of the way there.

First, your summary info as code:

df <- data.frame(value = c("1-6", "7-16", "17-21", "22-51", "52-80","81-110"),
           freq = c(300,400,300,1200,800,55))

Then we can create data that fits your summary numbers:

library(dplyr); library(tidyr)
set.seed(0)
df %>% 
  separate(value, c("min", "max"), remove = FALSE, convert = TRUE) %>%
  uncount(freq) %>%
  rowwise() %>%
  mutate(value_random = runif(1, min, max)) %>%
  ungroup()

 #  value   min   max value_random
 #  <chr> <int> <int>        <dbl>
 #1 1-6       1     6         4.90
 #2 1-6       1     6         3.62
 #3 1-6       1     6         2.23
 #4 1-6       1     6         3.67
 #...

Then you could get the summary stats you're looking for...

... %>% summarize(mean = mean(value_random),
                  median = median(value_random))

# output, will vary depending on the random seed set with "set.seed" above
   mean median
  <dbl>  <dbl>
1  36.8   33.9

See here for ways to calculate mode -- not straightforward in base R: How to find the statistical mode?

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Won't drawing from a uniform distribution between a range always end up converging on the midpoint of the range? – thelatemail Sep 17 '21 at 01:17
  • Within each interval, yes. But I understand the point of the OP was to get to a plausible understanding of the mean and median of the whole distribution, based on the known interval frequencies. If there is more domain knowledge or prior info that would guide a more plausible distribution of values within intervals, that could be used for a more accurate result. – Jon Spring Sep 17 '21 at 03:40
  • 1
    @thelatemail, For the purpose of estimating the mean, you could just take the weighted average of the midpoints of the intervals. Trickier for estimating the median, I think. – Jon Spring Sep 17 '21 at 05:57
  • Indeed. That was kind of what I was alluding to. It would be stable then too. The median could be estimated by the point n% of values over the minimum of the middle range. Or something like that – thelatemail Sep 17 '21 at 06:09