Converting a \u escaped Unicode string to ASCII

Question

After reading all about iconv and Encoding, I am still confused.

I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig' (displayed in the R console as 'pretty\\\u003D\\\u003Ebig'). I want to convert this to the ASCII string, which should be 'pretty=>big'.

More simply, if I set

x <- 'pretty\\u003D\\u003Ebig'

How do I perform a conversion on x to yield pretty=>big?

Any suggestions?

The code you are using might be needed to replicate this problem. — IRTFM, Jul 20 '13 at 17:45

score 12 · Accepted Answer · answered Jul 22 '13 at 12:35

12

Use parse, but don't evaluate the results:

x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1

answered Jul 22 '13 at 12:35

hadley

102,019
32
183
245

2

`as.character(x2)` will work too and will be vectorized (i.e.: `as.character(parse(text=paste0("'", rep(x1,3), "'")))`). Also `shQuote(x1)` could be handy instead of `paste0`. – Marek Apr 23 '14 at 21:00

score 8 · Answer 2 · answered Jan 29 '17 at 17:47

8

With the stringi package:

> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"

answered Jan 29 '17 at 17:47

Stéphane Laurent

75,186
15
119
225

seancarmody · Answer 3 · 2013-07-21T12:30:44.980

Although I have accepted Hong ooi's answer, I can't help thinking parse and eval is a heavyweight solution. Also, as pointed out, it is not secure, although for my application I can be confident that I will not get dangerous quotes.

So, I have devised an alternative, somewhat brutal, approach:

udecode <- function(string){
  uconv <- function(chars) intToUtf8(strtoi(chars, 16L))
  ufilter <- function(string) {
    if (substr(string, 1, 1)=="|") uconv(substr(string, 2, 5)) else string
  }
  string <- gsub("\\\\u([[:xdigit:]]{4})", ",|\\1,", string, perl=TRUE)
  strings <- unlist(strsplit(string, ","))
  string <- paste(sapply(strings, ufilter), collapse='')
  return(string)
}

Any simplifications welcomed!

score 2 · Answer 4 · edited May 07 '14 at 17:27

2

A use for eval(parse)!

eval(parse(text=paste0("'", x, "'")))

This has its own problems of course, such as having to manually escape any quote marks within the string. But it should work for any valid Unicode sequences that may appear.

edited May 07 '14 at 17:27

smci

32,567
20
113
146

answered Jul 20 '13 at 21:12

Hong Ooi

56,353
13
134
187

Fortunately, given the [source of the strings](x <- 'pretty\\u003D\\u003Ebig') quotes are already sanitised. – seancarmody Jul 20 '13 at 22:25
I have incorporated the fix into my [ngram package](https://github.com/seancarmody/ngramr). Thanks! – seancarmody Jul 20 '13 at 22:36
I've asked about sanitising here: http://stackoverflow.com/questions/17770093/sanitising-strings-in-r – Hong Ooi Jul 21 '13 at 07:44
@HongOoi see my answer - just don't eval the result. – hadley Jul 22 '13 at 12:36

score 1 · Answer 5 · answered Jul 20 '13 at 12:14

1

I sympathise; I have struggled with R and unicode text in the past and not always successfully. If your data is in x then first try a global replace, something like this:

x <- gsub("\u003D", "=>", x)

I sometimes use a construction like

lapply(x, utf8ToInt)

to see where the high code points are e.g. anything over 150. This helps me locate problems caused by non-breaking spaces, for example, which seem to pop up every now and again.

answered Jul 20 '13 at 12:14

SlowLearner

7,907
11
49
80

1

I'd need to use `x <- gsub("\u003D\u003E", "=>", x)` but I'd rather not cover every possible case. Note that `lapply(x, utf8ToInt)` yields `112 114 101 116 116 121 92 48 48 51 68 92 117 48 48 51 69 98 105 103` so that they are all low code points! I just need to undo the \u escaping! – seancarmody Jul 20 '13 at 12:17
1

Ah, good point, I am usually tackling Asian scripts which do yield high points, this is clearly a bit different. – SlowLearner Jul 20 '13 at 12:21

score 1 · Answer 6 · answered Jul 20 '13 at 14:05

1

> iconv('pretty\u003D\u003Ebig', "UTF-8", "ASCII")
[1] "pretty=>big"

but you appear to have an extra escape

answered Jul 20 '13 at 14:05

user1609452

4,406
1
15
20

Exactly! I can't help the extra escape: that's what the [web-page](http://books.google.com/ngrams/graph?content=pretty%3D%3Ebig&year_start=1800&year_end=2000&corpus=15) returns (click on it, view source and scroll down to the data). – seancarmody Jul 20 '13 at 14:18
It's also instructive to compare `cat('pretty\u003D\u003Ebig')` and `cat('pretty\\u003D\\u003Ebig')`. There is a big difference between escapes in data you enter at the console and escapes in data you obtain through other means. – seancarmody Jul 20 '13 at 14:19
are you using windows? – user1609452 Jul 20 '13 at 14:38

IRTFM · Answer 7 · 2013-07-20T23:46:09.260

1

The trick here is that '\\u003D' is actually 6 characters while you want '\u003D' which is only one character. The further trick is that to match those backslashes you need to use doubly escaped backslashes in the pattern:

gsub("\\\\u003D\\\\u003E", "\u003D\u003E", x)
#[1] "pretty=>big"

To replace multiple characters with one character you need to target the entire pattern. You cannot simply delete a backslash. (Since you have indicated this is a more general problem, I think the answer might lie in modifications to your as yet undescribed method for downloading this text.)

When I load your functions and the dependencies, this code works:

> freq <- ngram(c('pretty\u003D\u003Ebig'), year_start = 1950)
> 
> str(freq)
'data.frame':   59 obs. of  4 variables:
 $ Year     : num  1950 1951 1952 1953 1954 ...
 $ Phrase   : Factor w/ 1 level "pretty=>big": 1 1 1 1 1 1 1 1 1 1 ...
 $ Frequency: num  1.52e-10 6.03e-10 5.98e-10 8.27e-10 8.13e-10 ...
 $ Corpus   : Factor w/ 1 level "eng_2012": 1 1 1 1 1 1 1 1 1 1 ...

(So I guess I am still not clear on the use case.)

edited Jul 20 '13 at 23:46

answered Jul 20 '13 at 17:44

IRTFM

258,963
21
364
487

The gory detail of where the download is happening is [here](https://github.com/seancarmody/ngramr/blob/master/R/ngram.R), but just working with `x <- 'pretty\\u003D\\u003Ebig'` allows you to reproduce the problem. – seancarmody Jul 20 '13 at 22:24
I needed to identify the source for multiple missing function errors and loaded RCurl, stringr, httr, and RJSONIO before that would run. I tried with some of your commented code above those functions but am still not sure I have a test case to work on. – IRTFM Jul 20 '13 at 23:45
The relevant call that generates the error is `ngram("pretty=>big")`, but the reference to this package was to show the motivation only: it's too much for a reproducible example! – seancarmody Jul 21 '13 at 06:27
I should add that having accepted @Hong Ooi's solution, I the package no longer gives an error. Previously, a call to `ngram("pretty=>big")` would give `$ Phrase : Factor w/ 1 level "pretty\u003D\u003Ebig"`. – seancarmody Jul 21 '13 at 06:29
Mind you, I am still interested in whether it can be done without `eval` and `parse`. The original example stands: assuming we have `x <- 'pretty\\u003D\\u003Ebig'` as a given and need to convert it to `'pretty=>big'`. Rather than the `ngramr` package, you can take the motivating example to be in the source of [this page](http://books.google.com/ngrams/graph?content=pretty%3D%3Ebig&year_start=1800&year_end=2000&corpus=15). View page search and search for `\u003D` and you'll find the problem! – seancarmody Jul 21 '13 at 07:18

Converting a \u escaped Unicode string to ASCII

7 Answers7

Linked

Related