How to extract all numeric and symbols (punctuation marks) from dataframe until whitespace

Question

I have a dataframe containing character columns. The first column (V1) contains IDs and is followed by multiple columns containing strings with numbers, letters and symbols. What i would like is to extract all numeric values and symbols until there is a space in the string. Ideally, i would want all numeric vales and symbols found in column V2, to be written to a new column with a ";" seperator.

   df2 <- structure(list(V1 = c("00094", "00001", "00002", "00003", "00004", 
"00005", "00006", "00007", "00008", "00009"), V2 = c("4-6-2021 (aw), vaccinatie naam en data aangepast 19-8-2021 kv: ic nog niet ontvangen nav eerdere", 
NA, "23-7 mf: t2-vragenlijst omgezet naar t3, verzoek bij alienke om t2 af te keuren 6-12 mf: corona", 
NA, NA, "13-12 mf: 3 maanden na 2e vaccinatie corona", "20-7 mf: vaccinatiedatum blijkt enige vaccinatie 6-12 mf: corona", 
NA, NA, "15-7-2021 kv: corona gehad in maand05 2021, dus één vaccinatie. 6-12 mf: corona"
), V3 = c("eerdere brief. mf/sp telefonisch contact laten opnemen. 19-8 mf: gg, herinneringsmail gestuurd, komt niet a", 
NA, "corona gehad, 1 vaccinatie, per mail", NA, NA, "corona gekregen, per mail", 
"corona gehad, 1 vaccinatie, per mail", NA, NA, NA)), row.names = c(NA, 
10L), class = "data.frame")

This would be the desired output (column names are not important):

df2_new <- structure(list(V1 = c("00094", "00001", "00002", "00003", "00004", 
"00005", "00006", "00007", "00008", "00009"), V2 = c("4-6-2021 (aw), vaccinatie naam en data aangepast 19-8-2021 kv: ic nog niet ontvangen nav eerdere", 
NA, "23-7 mf: t2-vragenlijst omgezet naar t3, verzoek bij alienke om t2 af te keuren 6-12 mf: corona", 
NA, NA, "13-12 mf: 3 maanden na 2e vaccinatie corona", "20-7 mf: vaccinatiedatum blijkt enige vaccinatie 6-12 mf: corona", 
NA, NA, "15-7-2021 kv: corona gehad in maand05 2021, dus één vaccinatie. 6-12 mf: corona"
), V3 = c("eerdere brief. mf/sp telefonisch contact laten opnemen. 19-8 mf: gg, herinneringsmail gestuurd, komt niet a", 
NA, "corona gehad, 1 vaccinatie, per mail", NA, NA, "corona gekregen, per mail", 
"corona gehad, 1 vaccinatie, per mail", NA, NA, NA), `dates V2` = c("4-6-2021;19-8-2021", 
NA, "23-7;6-12", NA, NA, "13-12", "20-7;6-12", NA, NA, "15-7-2021;6-12"
), `dates V3` = c("19-8", NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
-10L), class = "data.frame")

Thanks so much!

score 2 · Answer 1 · edited Mar 29 '23 at 12:00

Here is a base R method. Based on your input and expected output and the fact you're looking for dates, I've taken the liberty of assuming that your specification is:

^[0-9]+   # Starts with at least one digit
[\\-]     # Has at least one hyphen
[0-9-]+$  # Ends with a digit or hyphen

The escaped hyphen doesn't need to be in square brackets but it's easier for it to be there if you want to add other symbols, or even replace it with [:punct:].

First define the function to split on whitespace, search for the pattern, and collapse the matches together:

get_all_numeric_symbols <- function(s, pattern="^[0-9]+[\\-][0-9-]+$") {
  out <- strsplit(s, " ") |>
    vapply(
      \(x) paste(
        grep(pattern, x, value=TRUE),
        collapse=";"
      ),
      character(1)
    )
  out[out == ""] <- NA_character_
  out
}

Then apply it to the desired columns:

cols <- c("V2", "V3")
new_cols <- paste0("dates_", cols)

df2[new_cols] <- lapply(cols, \(col) get_all_numeric_symbols(df2[[col]]))

# Check it matches expected output
identical(df2$dates_V2, df2_new$`dates V2`) # TRUE
identical(df2$dates_V3, df2_new$`dates V3`) # TRUE

Note - I am not checking for all punctuation here with "[:punct:]" because it can return strings which do not seem to be dates, e.g. those with @ or ?. I've used a much more minimal regex that excludes such strings, as well as digits without a hyphen.

You can extend it if you have other characters you wish to include, e.g. if you wanted to also include forward slashes, by adding them within both sets of square brackets. For example, "^[0-9]+[/\\-][/0-9\\-]+$" would match "29/03/2023".

score 1 · Accepted Answer · answered Mar 29 '23 at 10:21

This will result in the desired output:

library(stringr)
library(tidyverse)
regex <- c("(\\d+[:punct:])+\\d+")
df2$V4 <- str_extract_all(df2$V2, regex)
df2$V5 <- str_extract_all(df2$V3, regex)

df2 <- df2 %>% rowwise() %>% mutate(V4=paste(V4, sep="; ", collapse="; "),
                                    V5=paste(V5, sep="; ", collapse="; "))

How to extract all numeric and symbols (punctuation marks) from dataframe until whitespace

2 Answers2