Removing dates and all junks from texts using R

Question

I am cleaning a huge dataset made up of tens of thousands of texts using R. I know regular expression will do the job conveniently but I am poor in using it. I have combed stackoverflow but could not find solution. This is my dummy data:

foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", 
              "04/02/2016 Health is a priority: WAI000553",
              "09/ 08/2016 Economy is bad: 2031CE8D", 
              ": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")

I want to remove all the dates, punctuations and IDs and want my result to be this:

[1] "Education is good"        
[2] "Health is a priority"     
[3] "Economy is bad"           
[4] "Vehicle license is needed"

Any help in R will be appreciated.

Do any of the offered answers resolve your issue, William? – r2evans Apr 26 '21 at 21:36 — r2evans, Apr 26 '21 at 21:36

score 1 · Accepted Answer · answered Apr 22 '21 at 17:54

I think specificity is in order here:

First, let's remove the date-like strings. I'll assume either mm/dd/yyyy or dd/mm/yyyy, where the first two can be 1-2 digits, and the third is always 4 digits. If this is variable, the regex can be changed to be a little more permissive:

foo_data2 <- gsub("\\d{1,2}\\s*/\\s*\\d{1,2}\\s*/\\s*\\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982"        " Health is a priority: WO0002021"     " Economy is bad: WO001999"            " Vehicle license is needed: WO001050"

From here, the abbreviations seem rather easy to remove, as the other answers have demonstrated. You have not specified if the abbreviation is hard-coded to be anything after a colon, numbers prepended with "WO", or just some one-word combination of letters and numbers. Those could be:

gsub(":.*", "", foo_data2)
# [1] " Education is good"         " Health is a priority"      " Economy is bad"            " Vehicle license is needed"
gsub("\\bWO\\S*", "", foo_data2)
# [1] " Education is good: "         " Health is a priority: "      " Economy is bad: "            " Vehicle license is needed: "
gsub("\\b[A-Za-z]+\\d+\\b", "", foo_data2)
# [1] " Education is good: "         " Health is a priority: "      " Economy is bad: "            " Vehicle license is needed: "

The : removal should be straight forward, and using trimws(.) will remove the leading/trailing spaces.

This can obviously be combined into a single regex (using the logical | with pattern grouping) or a single R call (nested gsub) without complication, I kept them broken apart for discussion.

I think https://stackoverflow.com/a/22944075/3358272 is a good reference for regex in general, note that while that page shows many regex things with single-backslashes, R requires all of those use double-backslashes (e.g., \d in regex needs to be \\d in R). The exception to this is if you use R-4's new raw-strings, where these two are identical:

"\\b[A-Za-z]+\\d+\\b"
r"(\b[A-Za-z]+\d+\b)"

hi @r2evans, your answer looks good. However, can you modify the code to deal with situation whereby symbols are not hard-coded and dates not in the right format. I have modified the original question to reflect my question. Thanks. — William, Jun 11 '21 at 05:44
`gsub("^[ :]*|:[^:]*$", "", gsub("\\d{1,2}\\s*/\\s*\\d{1,2}\\s*/\\s*\\d{2}(\\d{2})?", "", foo_data))` — r2evans, Jun 11 '21 at 14:39
Thanks @r2evans. It looks okay in the dummy dataset but in the large dataset, not all abbreviations were removed, though date-like strings were removed. Abbreviations such as ```"11D2013A"```, ```"MLY3595"```, ```"WAI004882", "4Fun"``` were not removed from ```c("Education, support - 11D2013A", "- MLY3595 - Breast Feeding", "WAI004882", "Chevy - - 4Fun Literacy and Numeracy")```. The expected result should be ```"Education, support", "Breast Feeding", NA, "Literacy and Numeracy"```. Any help ll be appreciated. — William, Jun 14 '21 at 05:18
I can only work on what I "know". Perhaps you can add a few more examples that clearly communicate the difference and evidence the over-selection in the regex. — r2evans, Jun 14 '21 at 12:15

Peter · Answer 2 · 2021-04-22T07:37:00.297

Using stringr try this:

library(stringr)
library(magrittr)

str_remove_all(foo_data, "\\/|\\d+|\\: WO") %>% 
  str_squish()

#> [1] "Education is good"         "Health is a priority"     
#> [3] "Economy is bad"            "Vehicle license is needed"

^{Created on 2021-04-22 by the reprex package (v2.0.0)}

data

foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
              "09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")

score 0 · Answer 3 · answered Apr 22 '21 at 07:41

foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
              "09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\\d{4}[[:space:]]+(.*):.*", "\\1", foo_data)
#> [1] "Education is good"         "Health is a priority"     
#> [3] "Economy is bad"            "Vehicle license is needed"

^{Created on 2021-04-22 by the reprex package (v2.0.0)}

Removing dates and all junks from texts using R

3 Answers3

Linked