3

I have a data frame that has years in it (data type chr):

Years:
5 yrs
10 yrs
20 yrs
4 yrs

I want to keep only the integers to get a data frame like this (data type num):

Years:
5
10
20
4

How do I do this in R?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
questionmark
  • 335
  • 1
  • 13
  • 1
    See a bunch of possibilities at [How to extract the first number from a string?](https://stackoverflow.com/q/23323321/903061) – Gregor Thomas Jun 03 '20 at 16:31
  • A great reference @Gregor but the question and requirement weren't exactly aligned with what OP wanted. Submitted Title edit. – Chuck P Jun 03 '20 at 18:28
  • 1
    @ChuckP I'd encourage OP to edit the question changing the requirements (or better, ask a new question with the new requirements) rather than changing the title to not match the content of the question. – Gregor Thomas Jun 03 '20 at 18:34
  • No argument from me @Gregor on letting OP do the edit just not sure if they're still about and also eager to have the answers submitted pop up in future searches for future OPs – Chuck P Jun 03 '20 at 18:48
  • 1
    @ChuckP You could ask and answer your own question. I see that OP commented on the answer with that use-case, but looking more holistically, OP asked a pretty clear question, got 2 good answers. Editing the title now to not match the question content is terrible. Editing the title and question to not match one of the answers is bad (especially since it's the first and top answer, currently) - it would seem like sabotaging that answer. – Gregor Thomas Jun 03 '20 at 18:56
  • 1
    If you want to provide a resource for future users that's somewhat related to this question, ask a new question - make it a nice one with copy/pasteable sample data, and submit your answer/ – Gregor Thomas Jun 03 '20 at 18:56

3 Answers3

3

you need to extract the numbers and treat them as type numeric

df$year <- as.numeric(sub(" yrs", "", df$year))
Daniel O
  • 4,258
  • 6
  • 20
  • Thanks Daniel! How would I deal with entries like "yrs." for >1 years and "yr." for exactly 1 year? Can I account for that at the same time? – questionmark Jun 03 '20 at 16:25
  • 2
    with `" yrs.| yr."` as your pattern in the `sub` function – Daniel O Jun 03 '20 at 16:28
  • Thank you again, you're a legend! Do you know how I could take care of a range of values, i.e. 4-5 yrs.? Say, I wanted to take the average of the two? – questionmark Jun 03 '20 at 16:35
  • 1
    Your solution requirement keeps changing. It's important to give us all the requirements up front... I'll post another answer that approaches this differently now that you've added this new wrinkle but please tell us all the wrinkles up front. – Chuck P Jun 03 '20 at 18:10
1

Per your additional requirements a more general purpose solution but it has limits too. The nice thing about the more complicated years3 solution is it deals more gracefully with unexpected but quite possible answers.

library(dplyr)
library(stringr)
library(purrr)

Years <- c("5 yrs",
           "10 yrs",
           "20 yrs",
           "4 yrs",
           "4-5 yrs",
           "75 to 100 YEARS old",
           ">1 yearsmispelled or whatever")
df <- data.frame(Years)

# just the numbers but loses the -5 in 4-5
df$Years1 <- as.numeric(sub("(\\d{1,4}).*", "\\1", df$Years)) 
#> Warning: NAs introduced by coercion

# just the numbers but loses the -5 in 4-5 using str_extract
df$Years2 <- str_extract(df$Years, "[0-9]+")

# a lot more needed to account for averaging

df$Years3 <- str_extract_all(df$Years, "[0-9]+") %>%
  purrr::map( ~ ifelse(length(.x) == 1, 
                as.numeric(.x), 
                mean(unlist(as.numeric(.x)))))

df
#>                           Years Years1 Years2 Years3
#> 1                         5 yrs      5      5      5
#> 2                        10 yrs     10     10     10
#> 3                        20 yrs     20     20     20
#> 4                         4 yrs      4      4      4
#> 5                       4-5 yrs      4      4    4.5
#> 6           75 to 100 YEARS old     75     75   87.5
#> 7 >1 yearsmispelled or whatever     NA      1      1
Chuck P
  • 3,862
  • 3
  • 9
  • 20
  • If you do take this to a new question, `avg2nums` looks weird to me. The `Reduce` will make it perform a weird calculation on input > 2. I think `function(x) mean(unlist(x))` should be better - and is simple enough you could probably use it anonymously inside the `map` call. – Gregor Thomas Jun 03 '20 at 18:58
  • Great call edited away the `Reduce` don't know what I was thinking. – Chuck P Jun 03 '20 at 19:48
1

Base R solution:

clean_years <- as.numeric(gsub("\\D", "", Years))

Data:

Years <- c("5 yrs",
               "10 yrs",
               "20 yrs",
               "4 yrs",
               "5 yrs")
hello_friend
  • 5,682
  • 1
  • 11
  • 15