I think specificity is in order here:
First, let's remove the date-like strings. I'll assume either mm/dd/yyyy
or dd/mm/yyyy
, where the first two can be 1-2 digits, and the third is always 4 digits. If this is variable, the regex can be changed to be a little more permissive:
foo_data2 <- gsub("\\d{1,2}\\s*/\\s*\\d{1,2}\\s*/\\s*\\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982" " Health is a priority: WO0002021" " Economy is bad: WO001999" " Vehicle license is needed: WO001050"
From here, the abbreviations seem rather easy to remove, as the other answers have demonstrated. You have not specified if the abbreviation is hard-coded to be anything after a colon, numbers prepended with "WO"
, or just some one-word combination of letters and numbers. Those could be:
gsub(":.*", "", foo_data2)
# [1] " Education is good" " Health is a priority" " Economy is bad" " Vehicle license is needed"
gsub("\\bWO\\S*", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
gsub("\\b[A-Za-z]+\\d+\\b", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
The :
removal should be straight forward, and using trimws(.)
will remove the leading/trailing spaces.
This can obviously be combined into a single regex (using the logical |
with pattern grouping) or a single R call (nested gsub
) without complication, I kept them broken apart for discussion.
I think https://stackoverflow.com/a/22944075/3358272 is a good reference for regex in general, note that while that page shows many regex things with single-backslashes, R requires all of those use double-backslashes (e.g., \d
in regex needs to be \\d
in R). The exception to this is if you use R-4's new raw-strings, where these two are identical:
"\\b[A-Za-z]+\\d+\\b"
r"(\b[A-Za-z]+\d+\b)"