I have a data frame which looks like this.
How, using R codes, can I create a new column in the dataframe which holds the values of the midpoints of the age groups (such as 34.5 for "30 to 39 Years")?
I have a data frame which looks like this.
How, using R codes, can I create a new column in the dataframe which holds the values of the midpoints of the age groups (such as 34.5 for "30 to 39 Years")?
Here is an idea. First extract the min and max year of each age group. And in the next step create the mean of min and max.
Code
df <- df %>%
rowwise() %>%
mutate(min = as.numeric(unlist(str_extract_all(`Age Group`, "\\d+"))[1]),
max = as.numeric(unlist(str_extract_all(`Age Group`, "\\d+"))[2]),
midpoint = mean(c(min,max))) %>%
ungroup() %>%
select(-min, -max)
# A tibble: 3 x 3
`Outbreak Associated` `Age Group` midpoint
<fct> <fct> <dbl>
1 Sporadic 40 to 49 Years 44.5
2 Sporadic 140 to 149 Years 144.
3 Sporadic 20 to 29 Years 24.5
Data
df <- data.frame(`Outbreak Associated` = "Sporadic",
`Age Group` = c("40 to 49 Years",
"140 to 149 Years",
"20 to 29 Years"),
check.names = F)
Data
df <- data.frame(AgeGroup = c("30 to 39 Years", "40 to 49 Years", "50 to 59 Years", "100 to 109 Years")
You can do it one very long line.
df$AgeMean <- rowMeans(matrix(as.numeric(unlist(stringr::str_extract_all(df$AgeGroup, pattern = "[:digit:]{1,3}"))),ncol = 2, byrow = T))
So let me break it up:
stringr::str_extract_all
finds and extracts all groups of numbers 1 to 3 digits longstringr::str_extract_all
returns a list of two component character vectors. unlist
convert it to a single character vector and as.numeric
converts character to numericmatrix
packs the numeric vector in a two column matrixrowMeans
get us the mean ageYou will hear that it is helpful to include a reproducible example when asking questions. I put that together here for you as an example (and for any others who wander in here) based on the image you provided.
##### Load General Packages # install.packages("tidyverse", dependencies = TRUE)
library(tidyverse) # Installs family of packages in the tidyverse (dplyr, tidry, ggplot2, readr, purrr, tibble, stringr, forcats)
# Use this evironment option if you want your variables to be type 'character' versus type 'factor'
options(stringsAsFactors = FALSE) # This option surpresses auto factor creation on string imports
df <- data.frame(
outbreak_associated = c("Sporadic", "Sporadic", "Sporadic"),
age_group = c("40 to 49 Years", "40 to 49 Years", "20 to 29 Years")
)
Once you have this, you could do some simple math to estimate the mode:
df_fnl <- df %>%
# create numeric columns for the min and max of the age range for each age group
separate(col = age_group, into = c("age_min", "age_max"), sep = " to ", remove = FALSE) %>%
mutate(age_min = as.numeric(age_min)) %>%
mutate(age_max = as.numeric(gsub(" Years", "", age_max
# calculate the mode
mutate(age_mode = age_min + ((age_max-age_min)/2))
There might be more efficient options, though this will get the job done.