R - How to remove dates from a string?

Question

I am cleaning up a table that contains all sorts of weird naming conventions. One of the names I keep seeing is a string of dates. Some of these names contain numbers which are okay but I would like to remove date formats from the strings.

Edit - Dates are either in mm/YY or mm/YYYY format. The dates are normally from 2017 onwards as I have seen (we want more recent updates).

For example:

names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')

Should be:

c('IT Company', 'Tech Company', 'XYZ Company', '50/50 Phone Company')

I tried to use this function here to flag strings with "/" and dates but it also extracts numbers that are not dates:

names2 <- names[grepl("[[:digit:]]", names) & grepl("/", names)]

Output

> names2
[1] "IT Company 09/18"   
[2] "Tech Company 9/17"  
[3] "9/2018 XYZ Company" 
[4] "50/50 Phone Company"

Is there a specific date expression I can use in place of [[:digit:]] to find strings with dates?

Also, what is the function to remove dates including the slash from a string?

What do you want to be considered dates / not-dates? Is 10/10 a date? In this example you could just include `& !grepl("50/50", names)` — IceCreamToucan, Mar 14 '18 at 20:01
The dates from what I have seen are in mm/YY or mm/YYYY format. 10/10 would not be a date as it is dd/mm format. There are about 100K records and I have been putting names like "50/50 Phone Company" into an exception list. — cs_is_great, Mar 14 '18 at 20:08
You are right. For this exercise, we want recent dates from 2017 onwards. The data is very recent. I will edit my question. — cs_is_great, Mar 14 '18 at 20:15

HarlandMason · Accepted Answer · 2018-03-14T20:12:03.483

Sounds like a job for the stringr packages' string_remove_all function.

The trick is getting the regex pattern right. The fact that you don't have a standard date format you're looking out for and that you'd like to retain 50/50 make life tough.

This worked for me:

library('stringr')

date.pattern <- ' ?(0|1)?[1-9]/([0-9]{4}|[0-9]{2}) ?'

names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')
str_remove_all(names, date.pattern)

That regex is supposed to say

"A date is one number, or two numbers where the first number is 0 or 1. (Assumes the first number is always a month, which may have a leading zero for single digit months).
Then it (lazily) allows for years of length 2 or 4. This could be made more precise if you have a sense of which years you expect to encounter. "0001" probably isn't a year in your dataset.
Finally, it will also remove leading or trailing spaces if they exist, since it looks like dates can come before and after the part you care about and are space-delimited from the part you care about.

Output

[1] "IT Company"          "Tech Company"        "XYZ Company"        
[4] "50/50 Phone Company"

It might be smart to also run this through str_extract_all and inspect the outputs to see if they all have the appearance of dates.

Thanks for the detailed explanation. I tried to use the str_remove_all function but I get this error. Error in str_remove_all(names, date.pattern) : could not find function "str_remove_all" — cs_is_great, Mar 14 '18 at 20:19
If you load the latest version of `stringr` (1.3.0), `str_rmove_all` will be available. — IceCreamToucan, Mar 14 '18 at 20:26

score 1 · Answer 2 · answered Mar 14 '18 at 20:01

1

I believe the following will do what you want. It uses a regex followed by trimws to trim the white spaces from the beginning and end of the result.

trimws(gsub("[[:digit:]]{1,4}/[[:digit:]]{1,4}", "", names))
#[1] "IT Company"    "Tech Company"  "XYZ Company"   "Phone Company"

answered Mar 14 '18 at 20:01

Rui Barradas

70,273
8
34
66

This can work if I want to remove all numbers but I would like to keep 50/50 Phone Company. – cs_is_great Mar 14 '18 at 20:25

score 1 · Answer 3 · answered Mar 14 '18 at 20:08

Here is a stringr solution. This is complicated because it is hard to tell whether something is a date just by looking at it: you need to check your outputs and adjust this code if there are more cases. The first regular expression catches formats that are 1/2 digits, /, a 0 or 1, and one more digit, assuming any dates are post year 2000. The second line gets rid of digit, / and then four digits. This nicely excludes 50/50, but what if it was called 12/50 Phone Company? That could be interpreted as Dec 1950 in a certain date format. You'll have to make sure you don't accidentally exclude any similar special cases, there isn't a universal way to tell whether something is meant to be a date or not.

I would read up on regular expressions; this is a good resource to start.

library(stringr)
names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')

names %>%
  str_replace("\\d{1,2}/(0|1)\\d", "") %>%
  str_replace("\\d/\\d{4}", "") %>%
  str_trim()
#> [1] "IT Company"          "Tech Company"        "XYZ Company"        
#> [4] "50/50 Phone Company"

Created on 2018-03-14 by the reprex package (v0.2.0).

R - How to remove dates from a string?

3 Answers3

Linked