12

I have a matrix that contains the string "Energy per �m". Before the 'm' is a diamond shaped symbol with a question mark in it - I don't know what it is.

I have tried to get rid of it by using this on the column of the matrix:

a=gsub('Energy per �m','',a) 

[and using copy/paste for the first term of gsub], but it does not work.[unexpected symbol in "a=rep(5,Energy per"]. When I try to extract something from the original matrix with grepl I get:

46: In grepl("ref. value", raw$parameter) :
input string 15318 is invalid in this locale

How can I get rid of all this sort of signs? I would like to have only 0-9, A-Z, a-z, / and '. The rest can be zapped.

oguz ismail
  • 1
  • 16
  • 47
  • 69
Henk
  • 3,634
  • 5
  • 28
  • 54

2 Answers2

25

There is probably a better way to do this than with regex (e.g. by changing the Encoding).

But here is your regex solution:

gsub("[^0-9A-Za-z///' ]", "", a)
[1] "Energy per m"

But, as pointed out by @JoshuaUlrich, you're better off to use:

gsub("[^[:alnum:]///' ]", "", x)
[1] "Energy per m"
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 14
    `[^[:alnum:]]` is preferred to `[^0-9A-Za-z]`. Regarding the latter, `?regex` says "because their interpretation is locale- and implementation-dependent, they are best avoided." and "For example, `[[:alnum:]]` means `[0-9A-Za-z]`, except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set." – Joshua Ulrich Aug 15 '12 at 14:37
  • 4
    Thanks. I only know that from being Ripley'd for using your first solution in a package. ;-) – Joshua Ulrich Aug 15 '12 at 15:02
  • Thanks! Andrie, it turned out that I needed more characters, like .,()-? and space. Extending the regex or replacing it with [:alnum:]] or [[:print:]] gave strange results. The easiest solution was indeed to use encoding via iconv. This worked only after translating the column from factor to character. Phew... – Henk Aug 15 '12 at 15:24
  • @user580110 Do you care to post your `iconv` solution? I tried a few permutations, but couldn't get it to work. – Andrie Aug 15 '12 at 15:28
  • 1
    Ah, and the culprit was the "euro" sign. Setting the wrong encoding zaps the parameters. Latin1 just leaves an empty space. – Henk Aug 15 '12 at 15:33
  • 2
    # remove non-printable characters raw$parameter=as.character(raw$parameter); raw$parameter=iconv(raw$parameter,'Latin-9') – Henk Nov 06 '12 at 15:38
0

str_replace_all() is an option if you prefer to use the stringr package:

library(stringr)

x <- 'Energy per �m'

str_replace_all(x, "[^[:alnum:]///' ]", "")
[1] "Energy per m"
Harrison Jones
  • 2,256
  • 5
  • 27
  • 34