1

I am cleaning up a table that contains all sorts of weird naming conventions. One of the names I keep seeing is a string of dates. Some of these names contain numbers which are okay but I would like to remove date formats from the strings.

Edit - Dates are either in mm/YY or mm/YYYY format. The dates are normally from 2017 onwards as I have seen (we want more recent updates).

For example:

names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')

Should be:

c('IT Company', 'Tech Company', 'XYZ Company', '50/50 Phone Company') 

I tried to use this function here to flag strings with "/" and dates but it also extracts numbers that are not dates:

names2 <- names[grepl("[[:digit:]]", names) & grepl("/", names)]

Output

> names2
[1] "IT Company 09/18"   
[2] "Tech Company 9/17"  
[3] "9/2018 XYZ Company" 
[4] "50/50 Phone Company"

Is there a specific date expression I can use in place of [[:digit:]] to find strings with dates?

Also, what is the function to remove dates including the slash from a string?

cs_is_great
  • 33
  • 1
  • 9
  • What do you want to be considered dates / not-dates? Is 10/10 a date? In this example you could just include `& !grepl("50/50", names)` – IceCreamToucan Mar 14 '18 at 20:01
  • The dates from what I have seen are in mm/YY or mm/YYYY format. 10/10 would not be a date as it is dd/mm format. There are about 100K records and I have been putting names like "50/50 Phone Company" into an exception list. – cs_is_great Mar 14 '18 at 20:08
  • 10/10 could be mm/YY – IceCreamToucan Mar 14 '18 at 20:08
  • You are right. For this exercise, we want recent dates from 2017 onwards. The data is very recent. I will edit my question. – cs_is_great Mar 14 '18 at 20:15

3 Answers3

4

Sounds like a job for the stringr packages' string_remove_all function.

The trick is getting the regex pattern right. The fact that you don't have a standard date format you're looking out for and that you'd like to retain 50/50 make life tough.

This worked for me:

library('stringr')

date.pattern <- ' ?(0|1)?[1-9]/([0-9]{4}|[0-9]{2}) ?'

names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')
str_remove_all(names, date.pattern)

That regex is supposed to say

  • "A date is one number, or two numbers where the first number is 0 or 1. (Assumes the first number is always a month, which may have a leading zero for single digit months).

  • Then it (lazily) allows for years of length 2 or 4. This could be made more precise if you have a sense of which years you expect to encounter. "0001" probably isn't a year in your dataset.

  • Finally, it will also remove leading or trailing spaces if they exist, since it looks like dates can come before and after the part you care about and are space-delimited from the part you care about.

Output

[1] "IT Company"          "Tech Company"        "XYZ Company"        
[4] "50/50 Phone Company"

It might be smart to also run this through str_extract_all and inspect the outputs to see if they all have the appearance of dates.

HarlandMason
  • 779
  • 5
  • 17
1

I believe the following will do what you want. It uses a regex followed by trimws to trim the white spaces from the beginning and end of the result.

trimws(gsub("[[:digit:]]{1,4}/[[:digit:]]{1,4}", "", names))
#[1] "IT Company"    "Tech Company"  "XYZ Company"   "Phone Company"
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
1

Here is a stringr solution. This is complicated because it is hard to tell whether something is a date just by looking at it: you need to check your outputs and adjust this code if there are more cases. The first regular expression catches formats that are 1/2 digits, /, a 0 or 1, and one more digit, assuming any dates are post year 2000. The second line gets rid of digit, / and then four digits. This nicely excludes 50/50, but what if it was called 12/50 Phone Company? That could be interpreted as Dec 1950 in a certain date format. You'll have to make sure you don't accidentally exclude any similar special cases, there isn't a universal way to tell whether something is meant to be a date or not.

I would read up on regular expressions; this is a good resource to start.

library(stringr)
names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')

names %>%
  str_replace("\\d{1,2}/(0|1)\\d", "") %>%
  str_replace("\\d/\\d{4}", "") %>%
  str_trim()
#> [1] "IT Company"          "Tech Company"        "XYZ Company"        
#> [4] "50/50 Phone Company"

Created on 2018-03-14 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42