1

I have a character string in UTF-8, and need to 1) gsub special symbols, 2) convert whole data frame to "normal" (ASCII?) encoding. However, I fail to run gsub on it - doesn't catch the string. I work in French locale (tried UTF-8) but didn't get it granted. Cannot give you the full dataset, but will post a couple strings from it here.

  DataFrame = read.csv("SLD_products_FullData.csv",header=F,sep=",",encoding = "UTF-8")
  title = DataFrame$title

This is how title looks like in console:

"Campbell’s Gravy and General Mills • Cheerios"

And in Viewer:

Campbell’s Gravy and General Mills • Cheerios"   

Tried (with all kinds of perl, fixed, etc):

gsub("’","'",title)
gsub("•","-",title)

Even tried gsub("’","'",title). No luck.

Encoding(title)
[1] "UTF-8"

Any suggestions? Thanks!!

Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
  • 1
    your gsub command works for me. – Avinash Raj Mar 20 '15 at 13:45
  • Hmm... when I copy/paste the above text it does indeed work. But not in the original code/file encoding. – Alexey Ferapontov Mar 20 '15 at 13:51
  • Your `gsub` command works for me too, however, [this SO answer](http://stackoverflow.com/questions/28976569/why-r-gsub-or-regexp-for-punctuation-doesnt-get-all-punctuation) shows how to specify UTF encoding in a regex using `gsub`, so maybe that will help resolve your issue. – eipi10 Mar 20 '15 at 15:23
  • Thanks. Are you referring to `UPC` etc? I tried: `> title[8] [1] "Campbell’s Gravy" > gsub("(*UCP)(*UTF)’","--",title[8]) [1] "Campbell’s Gravy"` – Alexey Ferapontov Mar 20 '15 at 15:54
  • One more comment - copy/pasting from this page works, as encoding becomes correct. The original setup doesn't work. Any suggestions to how I can post here the file with sample string (cannot post full file for data copyright reasons) – Alexey Ferapontov Mar 20 '15 at 15:57

1 Answers1

0
title <- "Campbell’s Gravy and General Mills • Cheerios"
Encoding(title) 
#In your case it should be UTF-8

#iconvlist() to list possible encodings
#iconv to change encoding from latin1 to UTF-8
title <- iconv(title, "latin1", "UTF-8")
Encoding(title)
[1] "UTF-8"

#Apply Native encoding on the vector    
title <- enc2native(title)
Encoding(title)
[1] "latin1"

#Apply UTF8 encoding on the vector  
title <- enc2utf8(title)
Encoding(title)
[1] "UTF-8"

# Force a change in the encoding
Encoding(title) <- "latin1"
Encoding(title)
[1] "latin1"

title <- enc2utf8(title)
title2 <- gsub("•","-",title)
title2 <- gsub("’","'",title2)
title2
[1] "Campbell's Gravy and General Mills - Cheerios"

Encoding(title2)
[1] "unknown"

title2 <- gsub("(*UCP)(*UTF)•","--",title, perl= TRUE)
title2 <- gsub("(*UCP)(*UTF)’","'",title2, perl= TRUE)
title2
[1] "Campbell's Gravy and General Mills -- Cheerios"
Encoding(title2)
[1] "unknown"

"Unknown" is any other encoding than latin1, UTF-8, or bytes. See

Kvasir EnDevenir
  • 907
  • 1
  • 10
  • 25
  • I tried. Here's the part of the code: `print(title[8]) print(Encoding(title[8])) title2 <- gsub("’","-",title[8]) print(title2) ` result is: `[1] "Campbell’s Gravy" [1] "UTF-8" [1] "Campbell’s Gravy"` – Alexey Ferapontov Mar 20 '15 at 19:03
  • P.S. how do I put caret end in a comment so the code in the comment looks like one in an answer? – Alexey Ferapontov Mar 20 '15 at 19:04
  • Maybe: title2 <- gsub((enc2utf8("’","'",title2))) or title2 <- gsub("(*UCP)(*UTF)’","'",title2, perl= TRUE). Sorry I can't help further. Everything works perfectly here. – Kvasir EnDevenir Mar 20 '15 at 19:21