5

I have a few strings in a data set that contain the caharacters

\x96
\x92

and others.

I cant figure out how to grep for them in R.
I have tried using

pattern="\x96"
pattern="\\x96"
pattern="x96"

but to no avail.

Is there a specific way of dealing with such characters, specifically in R.


** UPDATE ** as per the suggestion in the comments, perl=TRUE allows the grep to work

Can anyone offer a solid explanation of what is going on?

session info, in case relevant

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_0.9.3    RMySQL_0.9-3     DBI_0.2-5        stringr_0.6.1    data.table_1.8.6
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Can you edit your title, please? It appears to have no connection to your actual question. – joran Feb 14 '13 at 16:44
  • Please give an example that reproduces your problem. `pattern <- "\x92"; grepl(pattern, "\x92")` works, so it's hard to guess what's wrong in your case... – Josh O'Brien Feb 14 '13 at 17:37
  • I'd guess these are characters that are not reproducible in standard ASCII, so it's showing us the hex representation or something like that. But I don't know how to recreate them or to grep for them (and clearly neither does the OP). @JoshO'Brien, any suggestions on how to reproduce if this is really what's happening? – Aaron left Stack Overflow Feb 14 '13 at 17:43
  • @JoshO'Brien, unfortunately, I cannot offer a ** reproducible** example. If I `dput` the values, they are printed as ASCII characters, and when I save them as a new object they are not the same characters as the offending, although they look simliar. – Ricardo Saporta Feb 14 '13 at 17:46
  • It's a ruby question, but perhaps this might lead to some better information: http://stackoverflow.com/questions/5053216/when-we-import-csv-data-how-eliminate-invalid-byte-sequence-in-utf-8 – Aaron left Stack Overflow Feb 14 '13 at 17:48
  • @JoshO'Brien FYI when I run your example `grepl` throws an error: "regular expression is invalid in this locale". – joran Feb 14 '13 at 17:49
  • Also maybe http://cran.r-project.org/doc/manuals/R-data.html#Encodings – Aaron left Stack Overflow Feb 14 '13 at 17:50
  • 1
    @joran -- Weird. Does setting `perl=TRUE` make any difference? (FWIW, my locale, from `Sys.getlocale()` is `English_United States.1252`, and I'm working on a Windows box.) – Josh O'Brien Feb 14 '13 at 17:50
  • @JoshO'Brien It does, that seems to work. – joran Feb 14 '13 at 17:52
  • Thanks @joran. So, Ricardo, does setting `perl=TRUE` happen to solve your issue as well? – Josh O'Brien Feb 14 '13 at 17:54
  • `perl=TRUE` did the trick! any thoughts as to whats going on? – Ricardo Saporta Feb 14 '13 at 18:59
  • 1
    @RicardoSaporta I have the same problem often from French characters. I've solved it before by setting the character encoding when I initially import my data. Windows-1252 is the one that comes to mind. Sometimes I even convert character encoding in a text editor before importing my data. Grep has problems finding them because they aren't literally `\x96`. `\x96` is just a representation of the actual character. – Brandon Bertelsen Feb 14 '13 at 21:04
  • @BrandonBertelsen Thank you for the suggestion. I did in fact convert the data set from `Latin1`. The `\x96` (and similar) characters are the ones that remained after converting. (Note that I also tried converting from other encodings as well, but that produced other issues, and the data source is pretty certain original is `Latin1`) – Ricardo Saporta Feb 15 '13 at 21:28
  • Yes, I had a very similar problem. I identified the original character set and then tried importing into R specifying the character set. Not sure you can do that now if you've saved over top of your previous character set. – Brandon Bertelsen Feb 15 '13 at 22:01

1 Answers1

2

R supports several different types of regular expressions. The default is POSIX ERE (extended regular expressions), which is the default in grep and other standard posix tools. But the POSIX ERE engine in R does not currently support escaping hex character codes:

Escaping non-metacharacters with a backslash is implementation-dependent. The current implementation interprets \a as BEL, \e as ESC, \f as FF, \n as LF, \r as CR and \t as TAB. (Note that these will be interpreted by R's parser in literal character strings.)

See Regular Expressions as used in R.

Setting perl=TRUE changes the engine used by R to process regular expressions to PCRE (perl-compatible regular expressions). PCRE supports escaped hex character codes -- and voila, your regex now works.

dpkp
  • 1,369
  • 7
  • 14