2

Related questions here R/dplyr: How to only keep integers in a data frame? and here R extract first number from string but this attempts to capture a relatively common case in some of the social sciences.

Imagine data as in the example below that are a mixture of text and numerals in one field. Perhaps the result of some survey process. You want to extract all the numbers reliably and ignore the text but if there are multiple numbers, perhaps a range of values you want to take the mean or median or some other numeric function to give you one "best" answer.

library(dplyr)
library(stringr)
library(purrr)
Years <- c("0 to 1 year",
           "5 yrs",
           "About 10 yrs",
           "20 yrs",
           "4 yrs",
           "4-5 yrs",
           "75 to 100 YEARS old",
           ">1 yearsmispelled or whatever")
df <- data.frame(Years)


Chuck P
  • 3,862
  • 3
  • 9
  • 20

2 Answers2

4

Although I know solutions are available using other regular expressions and with nothing more than base tools, I will present a tidyverse solution. Trivial to substitute in many other numeric operators such as median, and this solution will work no matter how many numbers are found.

library(dplyr)
library(stringr)

df <- df %>% 
  mutate(
    NewYears = str_extract_all(Years, "[0-9]+") %>%
      purrr::map( 
        ~ifelse(
           length(.x) == 1, 
           as.numeric(.x), 
           mean(unlist(as.numeric(.x)))
        )
      )
  )

df
#>                           Years NewYears
#> 1                   0 to 1 year      0.5
#> 2                         5 yrs        5
#> 3                  About 10 yrs       10
#> 4                        20 yrs       20
#> 5                         4 yrs        4
#> 6                       4-5 yrs      4.5
#> 7           75 to 100 YEARS old     87.5
#> 8 >1 yearsmispelled or whatever        1
Brian
  • 7,900
  • 1
  • 27
  • 41
Chuck P
  • 3,862
  • 3
  • 9
  • 20
2

You don't need to check for the length of the extracted number because mean/median/sum of 1 number is the same number.

sapply(stringr::str_extract_all(df$Years, '\\d+'),function(x) mean(as.numeric(x)))
#[1]  0.5  5.0 10.0 20.0  4.0  4.5 87.5  1.0

To do this using pipe :

library(dplyr)
library(stringr)
library(purrr)

df %>%
  mutate(NewYear = str_extract_all(Years, '\\d+'), 
         NewYear = map_dbl(NewYear, ~mean(as.numeric(.x))))


#                          Years NewYear
#1                   0 to 1 year     0.5
#2                         5 yrs     5.0
#3                  About 10 yrs    10.0
#4                        20 yrs    20.0
#5                         4 yrs     4.0
#6                       4-5 yrs     4.5
#7           75 to 100 YEARS old    87.5
#8 >1 yearsmispelled or whatever     1.0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you I realize I didn't have to explicitly check length for many operations but wanted to make it explicit in case a different tactic was needed. – Chuck P Jun 04 '20 at 11:54