How to get midpoint from range in string within a dataframe column in R?

Question

I have a data frame which looks like this.

How, using R codes, can I create a new column in the dataframe which holds the values of the midpoints of the age groups (such as 34.5 for "30 to 39 Years")?

It would be easier to help if you create a small reproducible example (dataset) along with your expected output. So everyone can test their ideas and see which one might be an answer. Therefore you need to add your dataset as a codejunk not as a image. Here are some information about reproducible examples: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — tamtam, Mar 10 '21 at 14:38

score 0 · Answer 1 · answered Mar 10 '21 at 14:35

Here is an idea. First extract the min and max year of each age group. And in the next step create the mean of min and max.

Code

df <- df %>% 
  rowwise() %>%
  mutate(min = as.numeric(unlist(str_extract_all(`Age Group`, "\\d+"))[1]),
         max = as.numeric(unlist(str_extract_all(`Age Group`, "\\d+"))[2]),
         midpoint = mean(c(min,max))) %>%
  ungroup() %>%
  select(-min, -max)

# A tibble: 3 x 3
  `Outbreak Associated` `Age Group`      midpoint
  <fct>                 <fct>               <dbl>
1 Sporadic              40 to 49 Years       44.5
2 Sporadic              140 to 149 Years    144. 
3 Sporadic              20 to 29 Years       24.5

Data

df <- data.frame(`Outbreak Associated` = "Sporadic",
                 `Age Group` = c("40 to 49 Years",
                                 "140 to 149 Years", 
                                 "20 to 29 Years"),
                 check.names = F)

score 0 · Answer 2 · answered Mar 10 '21 at 14:39

Data

df <- data.frame(AgeGroup = c("30 to 39 Years", "40 to 49 Years", "50 to 59 Years", "100 to 109 Years")

You can do it one very long line.

df$AgeMean <- rowMeans(matrix(as.numeric(unlist(stringr::str_extract_all(df$AgeGroup, pattern = "[:digit:]{1,3}"))),ncol = 2, byrow = T))

So let me break it up:

stringr::str_extract_all finds and extracts all groups of numbers 1 to 3 digits long
stringr::str_extract_all returns a list of two component character vectors. unlist convert it to a single character vector and as.numeric converts character to numeric
matrix packs the numeric vector in a two column matrix
rowMeans get us the mean age

Thank you so much! Such an elegant solution. – abrarnasir Mar 25 '21 at 12:44 — abrarnasir, Mar 25 '21 at 12:44

score 0 · Answer 3 · edited Mar 10 '21 at 15:39

You will hear that it is helpful to include a reproducible example when asking questions. I put that together here for you as an example (and for any others who wander in here) based on the image you provided.

##### Load General Packages                                   # install.packages("tidyverse", dependencies = TRUE)
library(tidyverse)                                            # Installs family of packages in the tidyverse (dplyr, tidry, ggplot2, readr, purrr, tibble, stringr, forcats)

# Use this evironment option if you want your variables to be type 'character' versus type 'factor'
options(stringsAsFactors = FALSE)                             # This option surpresses auto factor creation on string imports  


df <- data.frame(
  outbreak_associated = c("Sporadic", "Sporadic", "Sporadic"),
  age_group = c("40 to 49 Years", "40 to 49 Years", "20 to 29 Years")
)

Once you have this, you could do some simple math to estimate the mode:

df_fnl <- df %>%
  # create numeric columns for the min and max of the age range for each age group
  separate(col = age_group, into = c("age_min", "age_max"), sep = " to ", remove = FALSE) %>%
  mutate(age_min = as.numeric(age_min)) %>%
  mutate(age_max = as.numeric(gsub(" Years", "", age_max
  # calculate the mode
  mutate(age_mode = age_min + ((age_max-age_min)/2))

There might be more efficient options, though this will get the job done.

How to get midpoint from range in string within a dataframe column in R?

3 Answers3