19

I am new to R so I hope you can help me.

I want to use gsub to remove all punctuation except for periods and minus signs so I can keep decimal points and negative symbols in my data.

Example

My data frame z has the following data:

     [,1] [,2]   
[1,] "1"  "6"    
[2,] "2@"  "7.235"
[3,] "3"  "8"    
[4,] "4"  "$9"   
[5,] "£5" "-10" 

I want to use gsub("[[:punct:]]", "", z) to remove the punctuation.

Current output

> gsub("[[:punct:]]", "", z)
     [,1] [,2]  
[1,] "1"  "6"   
[2,] "2"  "7235"
[3,] "3"  "8"   
[4,] "4"  "9"   
[5,] "5"  "10" 

I would like, however, to keep the "-" sign and the "." sign.

Desired output

 PSEUDO CODE:  
> gsub("[[:punct:]]", "", z, except(".", "-") )
         [,1] [,2]  
    [1,] "1"  "6"   
    [2,] "2"  "7.235"
    [3,] "3"  "8"   
    [4,] "4"  "9"   
    [5,] "5"  "-10" 

Any ideas how I can make some characters exempt from the gsub() function?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

4 Answers4

22

You can put back some matches like this:

 sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
     X..1. X..2.  
[1,] "1"   "6"    
[2,] "2"   "7.235"
[3,] "3"   "8"    
[4,] "4"   "9"    
[5,] "5"   "-10"  

Here I am keeping the . and -.

And I guess , the next step is to coerce you result to a numeric matrix, SO here I combine the 2 steps like this:

matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
   [,1]    [,2]
[1,]    1   6.000
[2,]    2   7.235
[3,]    3   8.000
[4,]    4   9.000
[5,]    5 -10.000
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thanks for this, works perfectly. I only needed the first part. Could you explain whats happening here. Like I understand that you are seperating the . and - from the :punct: but not sure how.. – Crayon Constantinople Feb 03 '14 at 19:53
  • The \\1 is syntax for the last capture in a regular expression using the () It says whatever was matched, replace it with that. I put only the "." and "-" in the group (), so \\1 will replace .- (by the same vale), so it keeps them here. – agstudy Feb 03 '14 at 20:02
  • @CrayonConstantinople I am not sure that my "english" explanation is good, maybe better to read about group capture [here](http://www.regular-expressions.info/named.html). – agstudy Feb 03 '14 at 20:03
  • Thank you for your help. Out of interest, is there much change needed for this to change it to a data.frame rather than a matrix? – Crayon Constantinople Feb 04 '14 at 13:06
  • @CrayonConstantinople just apply `as.data.frame` in this. – agstudy Feb 04 '14 at 13:12
  • Great example, but I was wondering if the order of the symbols is important, do they have to be ascending or descending? I noticed that `gsub("([_.-])|([[:punct:]])", "\\1", "name?%_ -4")` and `gsub("([.-_])|([[:punct:]])", "\\1", "name?%_ -4")` have different results; `"name_ -4"` and `"name?_ 4"`, and the only difference is the position of the underscore. – geneorama Jun 10 '19 at 22:45
  • I think the `-` is being literally interpreted because it's at the end of the regex. If it were in between two characters it would have meant "range". – geneorama Jun 10 '19 at 22:50
8

You may try this code. I found it quite handy.

x <- c('6,345', '7.235', '8', '$9', '-10')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)

[1] "6345"  "7.235" "8"     "9"     "-10"

x <- c('1', '2@', '3', '4', '£5')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)

[1] "1" "2" "3" "4" "5"

This code{gsub("[^[:alnum:]]", "", x))} removes everything that does not include alphanumeric terms. Then we add to the exception list. Here we add hyphen(\-), full-stop(\.) and space(\s) to get gsub("[^[:alnum:]\-\.\s]", "", x). Now it removes everything that is not alphanumeric, hyphen, full stop and space.

user6793824
  • 141
  • 2
  • 3
6

Here are some options to restrict a generic character class in R using both base R (g)sub and the stringr remove/replace functions:

(g)sub with perl=TRUE

You may use the [[:punct:]] bracket expression with the [:punct:] POSIX character class and restrict it with the (?!\.) negative lookahead that will require that the immediately following char on the right is not equal to .:

(?!\.)[[:punct:]]      # Excluding a dot only
(?![.-])[[:punct:]]    # Excluding a dot and hyphen

To match one or more occurrences, wrap it with a non-capturing group and then set the + quantifier to the group:

(?:(?!\.)[[:punct:]])+   # Excluding a dot only
(?:(?![.-])[[:punct:]])+ # Excluding a dot and hyphen

Note that when you remove found matches, both expressions will yield the same results, however, when you need to replace with some other string/char, the quantification will allow changing whole consecutive character chunks with a single occurrence of the replacement pattern.

With stringr replace/remove functions

Before going into details, mind that the PCRE [[:punct:]] used with (g)sub will not match the same chars in the stringr regex functions that are powered by the ICU regex library. You need to use [\p{P}\p{S}] instead, see R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

The ICU regex library has a nice feature that can be used with character classes, called character class subtraction.

So, you write your character class, say, all punctuation matching class like [\p{P}\p{S}], and then you want to "exclude" (=subtract) a char or two or three, or a whole subclass of chars. You may use two notations:

[\p{P}\p{S}&&[^.]]   # Excluding a dot
[\p{P}\p{S}--[.]]    # Excluding a dot
[\p{P}\p{S}&&[^.-]]  # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]]   # Excluding a dot and hyphen

To match 1+ consecutive occurrences with this approach, you do not need any wrapping groups, simply use +:

[\p{P}\p{S}&&[^.]]+  # Excluding a dot
[\p{P}\p{S}--[.]]+   # Excluding a dot
[\p{P}\p{S}&&[^.-]]+  # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]]+   # Excluding a dot and hyphen

See R demo tests with outputs:

x <- "Abc.123#&*xxx(x-y-z)???? some@other!chars."

gsub("(?!\\.)[[:punct:]]", "", x, perl=TRUE)
## => [1] "Abc.123xxxxyz someotherchars."
gsub("(?!\\.)[[:punct:]]", "~", x, perl=TRUE)
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
gsub("(?:(?!\\.)[[:punct:]])+", "~", x, perl=TRUE)
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."

library(stringr)
stringr::str_remove_all(x, "[\\p{P}\\p{S}&&[^.]]") # Same as "[\\p{P}\\p{S}--[.]]"
## => [1] "Abc.123xxxxyz someotherchars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]", "~")
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]+", "~")  # Same as "[\\p{P}\\p{S}--[.]]+"
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

Another way to think about it is what do you want to keep? You can use regular expressions to both keep information as well as omit it. I have a lot of data frames that I need to clean units out of and convert from multiple rows in one pass and I find it easiest to use something from the apply family in these instances.

Recreating the example:

a <- c('1', '2@', '3', '4', '£5')
b <- c('6', '7.235', '8', '$9', '-10')
z <- matrix(data = c(a, b), nrow = length(a), ncol=2)

Then use apply in conjunction with gsub.

apply(z, 2, function(x) as.numeric(gsub('[^0-9\\.\\-]', '', x)))
      [,1]    [,2]
[1,]    1   6.000
[2,]    2   7.235
[3,]    3   8.000
[4,]    4   9.000
[5,]    5 -10.000

This instructs R to match everything except digits, periods, and hyphens/dashes. Personally, I find it much cleaner and easier to use in these situations and gives the same output.

Also, the documentation has a good explanation of these powerful but confusing regular expressions.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html

Or ?regex

hubbs5
  • 1,235
  • 1
  • 12
  • 22