13

After reading all about iconv and Encoding, I am still confused.

I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig' (displayed in the R console as 'pretty\\\u003D\\\u003Ebig'). I want to convert this to the ASCII string, which should be 'pretty=>big'.

More simply, if I set

x <- 'pretty\\u003D\\u003Ebig'

How do I perform a conversion on x to yield pretty=>big?

Any suggestions?

smci
  • 32,567
  • 20
  • 113
  • 146
seancarmody
  • 6,182
  • 2
  • 34
  • 31

7 Answers7

12

Use parse, but don't evaluate the results:

x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1
hadley
  • 102,019
  • 32
  • 183
  • 245
  • 2
    `as.character(x2)` will work too and will be vectorized (i.e.: `as.character(parse(text=paste0("'", rep(x1,3), "'")))`). Also `shQuote(x1)` could be handy instead of `paste0`. – Marek Apr 23 '14 at 21:00
8

With the stringi package:

> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"
Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225
4

Although I have accepted Hong ooi's answer, I can't help thinking parse and eval is a heavyweight solution. Also, as pointed out, it is not secure, although for my application I can be confident that I will not get dangerous quotes.

So, I have devised an alternative, somewhat brutal, approach:

udecode <- function(string){
  uconv <- function(chars) intToUtf8(strtoi(chars, 16L))
  ufilter <- function(string) {
    if (substr(string, 1, 1)=="|") uconv(substr(string, 2, 5)) else string
  }
  string <- gsub("\\\\u([[:xdigit:]]{4})", ",|\\1,", string, perl=TRUE)
  strings <- unlist(strsplit(string, ","))
  string <- paste(sapply(strings, ufilter), collapse='')
  return(string)
}

Any simplifications welcomed!

seancarmody
  • 6,182
  • 2
  • 34
  • 31
2

A use for eval(parse)!

eval(parse(text=paste0("'", x, "'")))

This has its own problems of course, such as having to manually escape any quote marks within the string. But it should work for any valid Unicode sequences that may appear.

smci
  • 32,567
  • 20
  • 113
  • 146
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
1

I sympathise; I have struggled with R and unicode text in the past and not always successfully. If your data is in x then first try a global replace, something like this:

x <- gsub("\u003D", "=>", x)

I sometimes use a construction like

lapply(x, utf8ToInt)

to see where the high code points are e.g. anything over 150. This helps me locate problems caused by non-breaking spaces, for example, which seem to pop up every now and again.

SlowLearner
  • 7,907
  • 11
  • 49
  • 80
  • 1
    I'd need to use `x <- gsub("\u003D\u003E", "=>", x)` but I'd rather not cover every possible case. Note that `lapply(x, utf8ToInt)` yields `112 114 101 116 116 121 92 48 48 51 68 92 117 48 48 51 69 98 105 103` so that they are all low code points! I just need to undo the \u escaping! – seancarmody Jul 20 '13 at 12:17
  • 1
    Ah, good point, I am usually tackling Asian scripts which do yield high points, this is clearly a bit different. – SlowLearner Jul 20 '13 at 12:21
1
> iconv('pretty\u003D\u003Ebig', "UTF-8", "ASCII")
[1] "pretty=>big"

but you appear to have an extra escape

user1609452
  • 4,406
  • 1
  • 15
  • 20
  • Exactly! I can't help the extra escape: that's what the [web-page](http://books.google.com/ngrams/graph?content=pretty%3D%3Ebig&year_start=1800&year_end=2000&corpus=15) returns (click on it, view source and scroll down to the data). – seancarmody Jul 20 '13 at 14:18
  • It's also instructive to compare `cat('pretty\u003D\u003Ebig')` and `cat('pretty\\u003D\\u003Ebig')`. There is a big difference between escapes in data you enter at the console and escapes in data you obtain through other means. – seancarmody Jul 20 '13 at 14:19
  • are you using windows? – user1609452 Jul 20 '13 at 14:38
1

The trick here is that '\\u003D' is actually 6 characters while you want '\u003D' which is only one character. The further trick is that to match those backslashes you need to use doubly escaped backslashes in the pattern:

gsub("\\\\u003D\\\\u003E", "\u003D\u003E", x)
#[1] "pretty=>big"

To replace multiple characters with one character you need to target the entire pattern. You cannot simply delete a backslash. (Since you have indicated this is a more general problem, I think the answer might lie in modifications to your as yet undescribed method for downloading this text.)

When I load your functions and the dependencies, this code works:

> freq <- ngram(c('pretty\u003D\u003Ebig'), year_start = 1950)
> 
> str(freq)
'data.frame':   59 obs. of  4 variables:
 $ Year     : num  1950 1951 1952 1953 1954 ...
 $ Phrase   : Factor w/ 1 level "pretty=>big": 1 1 1 1 1 1 1 1 1 1 ...
 $ Frequency: num  1.52e-10 6.03e-10 5.98e-10 8.27e-10 8.13e-10 ...
 $ Corpus   : Factor w/ 1 level "eng_2012": 1 1 1 1 1 1 1 1 1 1 ...

(So I guess I am still not clear on the use case.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • The gory detail of where the download is happening is [here](https://github.com/seancarmody/ngramr/blob/master/R/ngram.R), but just working with `x <- 'pretty\\u003D\\u003Ebig'` allows you to reproduce the problem. – seancarmody Jul 20 '13 at 22:24
  • I needed to identify the source for multiple missing function errors and loaded RCurl, stringr, httr, and RJSONIO before that would run. I tried with some of your commented code above those functions but am still not sure I have a test case to work on. – IRTFM Jul 20 '13 at 23:45
  • The relevant call that generates the error is `ngram("pretty=>big")`, but the reference to this package was to show the motivation only: it's too much for a reproducible example! – seancarmody Jul 21 '13 at 06:27
  • I should add that having accepted @Hong Ooi's solution, I the package no longer gives an error. Previously, a call to `ngram("pretty=>big")` would give `$ Phrase : Factor w/ 1 level "pretty\u003D\u003Ebig"`. – seancarmody Jul 21 '13 at 06:29
  • Mind you, I am still interested in whether it can be done without `eval` and `parse`. The original example stands: assuming we have `x <- 'pretty\\u003D\\u003Ebig'` as a given and need to convert it to `'pretty=>big'`. Rather than the `ngramr` package, you can take the motivating example to be in the source of [this page](http://books.google.com/ngrams/graph?content=pretty%3D%3Ebig&year_start=1800&year_end=2000&corpus=15). View page search and search for `\u003D` and you'll find the problem! – seancarmody Jul 21 '13 at 07:18