2

The question is pretty simple. I'm trying to replace "\U" throughout a vector of strings, and for this I'm using the package {stringr}, but I'm having issues matching the pattern.

text <- "\U0001f517"

stringr::str_detect(text, "\U")
#> Error: '\U' used without hex digits in character string starting ""\U"

stringr::str_detect(text, "\\U")
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
#>   Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\U`)

stringr::str_detect(text, "\\\U")
#> Error: '\U' used without hex digits in character string starting ""\\\U"

stringr::str_detect(text, "\\\\U")
#> FALSE

stringr::str_detect(text, "\\\\\U")
#> Error: '\U' used without hex digits in character string starting ""\\\\\U"

stringr::str_detect(text, "\\\\\\U")
#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
#>   Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\\\U`)

stringr::str_detect(text, "\\\\\\\U")
#> Error: '\U' used without hex digits in character string starting ""\\\\\\\U"

# ... you get the idea

As far as I can tell, this issue is because the regex engine sees "\U" as indicating the beginning of a new hex code, as indicated by the first error. Other characters work fine:

text <- "\a0001f517"

stringr::str_detect(text, "\a")
#> TRUE

I've seen other questions around this issue, e.g. here, but still can't get this to work. Can anyone give me a working regex for this?

wurli
  • 2,314
  • 10
  • 17
  • 2
    ``\U`` in ``text <- "\U0001f517"`` is not a separate char sequence, it is part of the Unicode character code point notation. `text` is in fact ``. ``"\a"`` is a single character, `\u0007` (run ``"\a" == '\x07'`` that outputs "TRUE"). – Wiktor Stribiżew Sep 27 '21 at 08:46
  • Aha! Think I understand. So I would actually need the original string to be something like `"\\U0001f517"` then? – wurli Sep 27 '21 at 08:48
  • I am not sure if I fully understand your question but perhaps the following post can be of help: https://stackoverflow.com/questions/25424382/replace-single-backslash-in-r – user16949460 Sep 27 '21 at 08:48
  • @user16949460 No, it is not about replacing backslashes. – Wiktor Stribiżew Sep 27 '21 at 08:48
  • 1
    `library(utf8)` and then `utf8_encode(text)` – Wiktor Stribiżew Sep 27 '21 at 08:56
  • @Wiktor Stribiżew Thank you, that's perfect! If you'd like to write up that as a standalone answer I'll happily accept it :) – wurli Sep 27 '21 at 09:01

1 Answers1

4

\U in your text <- "\U0001f517" is not a separate char sequence, it is part of the Unicode character code point notation. The literal text in the text variable is in fact , you can easily check that using cat(text).

On the contrary, "\a" is a single character (a "Bell" character) that can also be written as "\u0007" or "\x07" (run "\a" == '\x07' and you will see that the output is TRUE). See more about string escape sequences syntax.

In R, to get the underlying string literal as a literal string, you can use

text <- "\U0001f517"
cat(text)
## =>  

library("utf8")
text <- utf8_encode(text)
cat(text)
## => \U0001f517
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563