2

I want to remove numbers (integers and floats) from a character vector, preserving dates:

"I'd like to delete numbers like 84 and 0.5 but not dates like 2015"

I would like to get:

"I'd like to delete numbers like and but not dates like 2015"

In English a quick and dirty rule could be: if the number starts with 18, 19, or 20 and has length 4, don't delete.

I asked the same question in Python and the answer was very satisfying (\b(?!(?:18|19|20)\d{2}\b(?!\.\d))\d*\.?).

However, when I pass the same regex to grepl in R:

gsub("[\b(?!(?:18|19|20)\d{2}\b(?!\.\d))\d*\.?]"," ", "I'd like to delete numbers like 84 and 0.5 but not dates like 2015")

I get:

Error: '\d' is an unrecognized escape in character string starting ""\b(?!(?:18|19|20)\d"

Community
  • 1
  • 1
Antoine
  • 1,649
  • 4
  • 23
  • 50
  • 1
    Just FYI: In R `gsub`, you need to double backslashes. And you should not put all into a character class `[...]`. Also, the lookahead requires the use of `perl=T`. – Wiktor Stribiżew Jan 05 '16 at 11:23
  • replace \ with \\. @stribizhev not only for gsub.. And also don't put your regex inside `[]`. – Avinash Raj Jan 05 '16 at 11:23
  • 2
    Use [`gsub("\\b(?!(?:18|19|20)\\d{2}\\b(?!\\.\\d))\\d*\\.?\\d+\\b"," ", "I'd like to delete numbers like 84 and 0.5 but not dates like 2015", perl=T)`](https://ideone.com/nEh4Ea). – Wiktor Stribiżew Jan 05 '16 at 11:29
  • @stribizhev this is perfect thanks. if you make it an answer I'll accept it and close the thread – Antoine Jan 05 '16 at 11:32

2 Answers2

2

As I mentioned in my comments, the main points here are:

  • regex pattern should be placed outside the character class to be treated as a sequence of subpatterns and not as separate symbols inside the class
  • the backslashes must be doubled in R regex patterns (since it uses C strings where \ is used to escape entities like \n, \r, etc)
  • and also you need to use perl=T with patterns featuring lookarounds (you are using lookaheads in yours)

Use

gsub("\\b(?!(?:18|19|20)\\d{2}\\b(?!\\.\\d))\\d*\\.?\\d+\\b"," ", "I'd like to delete numbers like 84 and 0.5 but not dates like 2015", perl=T)

See IDEONE demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

To search and replace in R you can use:

gsub("\\b(?!(?:18|19|20)\\p{Nd}{2}\\b(?!\\.\\p{Nd}))\\p{Nd}*\\.?", "replacement_text_here", subject, perl=TRUE);
Andie2302
  • 4,825
  • 4
  • 24
  • 43