10

I am trying to change the encoding of a column in a dataframe.

stri_enc_mark(data_updated$text)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

When I try to convert it, it does not throw an error, but still has no effect on the vector:

d <- enc2utf8(data_updated$text)
stri_enc_mark(d)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

Any suggestions?

I am on Windows 7, 32bit. Adding data snippet.

> Encoding(data_updated$text[1:35])
 [1] "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"  
 [8] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"  
[15] "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown"
[22] "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown"
[29] "unknown" "UTF-8"   "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"

Data looks like this.

> data_updated$text[1:35]
 [1] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [2] "Deal Talks for Here Mapping Service Expose Reliance on Location Data, via @nytimes #mapping #dilemma  http://t.co/wGdiS5OlRq"                      
 [3] "http://t.co/UZIyX1Rk7W The popping linksexploaded!! http://t.co/KpNntm1dH7 :) http://t.co/oku91uVxZ8"                                              
 [4] "RT @davidsunaria90: Wtch LIVE Mjlis Now\n http://t.co/GXNhe3eY7Y\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/YewOVcz8bb\n…" 
 [5] "Reliance Jio Infocomm: Indian carrier raises $750 million loan for 4G rollout  http://t.co/B2aWlkmwXz"                                             
 [6] "RT @SurjeetInsan: Majlis started in Sirsa Ashram.\nLive @ http://t.co/PR6W5tzZes\nIVR Airtel 55252\nReliance 56300403\n\n#MSGPlsSaveTheEarth"      
 [7] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Techno… http://t.co/kyxTYIxks5"      
 [8] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [9] "RT @jaameinsan: Watch LIVE Majlis Now\n http://t.co/nPQegnLXPa\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/txXMtw3zFP\n#M…" 
[10] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Technology"

These are tweets, and I think the "http://" links are dictating encoding here, given that they have expressions like "wGdiS5OlRq". For analysis I had removed these tags using regular expressions. But to store raw data in a DB i need these tweets. MongoDB does not have problem, but a RDBMS throws issues.

Frank
  • 66,179
  • 8
  • 96
  • 180
NEO
  • 441
  • 1
  • 4
  • 12
  • It would help to have a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It would also be helpful to know what OS you are on and what `Encoding()` returns for those vectors. It's possible that if there are not any non-ascii characters in the string it will just return ASCII. – MrFlick May 14 '15 at 04:58
  • This is a pretty classic example when a problem could be simplified too. You have 36 data points. You need 2 to show off this problem - `data_updated$text[1:2]` would be plenty enough to show nothing changes from ASCII to UTF-8 – thelatemail May 14 '15 at 05:39
  • If the problem really is the RDBMS is throwing errors, then it would be better to describe that problem. The encoding of strings that only have ASCII characters shouldn't cause a problem. – MrFlick May 14 '15 at 06:16
  • The data table I am porting the data to is UTF-8 encoded. Hence I think it does not accept ASCII, the error says, "expected UTF-8" – NEO May 14 '15 at 08:27
  • But something that's ASCII encoded is also UTF-8 encoded. There would be nothing different in the bytes of the two stings. You can't tell the difference. How is this mystery function checking? – MrFlick May 14 '15 at 14:08

3 Answers3

9

In case someone is still stuck : I used Encoding().

  for (col in colnames(mydataframe)){
  Encoding(mydataframe[[col]]) <- "UTF-8"}
J.Delannoy
  • 345
  • 5
  • 15
  • I got "Error in `Encoding<-`(`*tmp*`, value = "UTF-8") : a character vector argument expected" with this solution – jcarlos Oct 21 '21 at 15:52
  • Can try this solution to resolve the error: https://stackoverflow.com/questions/33731891/about-the-use-of-encoding-function – kchuying Apr 14 '22 at 13:52
2

It appears that we can use the conv() function to convert the encoding after we convert the vector into Factor and then back to character vector. It is a bit strange to be honest.

NEO
  • 441
  • 1
  • 4
  • 12
1

I found stringi::stri_enc_toascii() is pretty useful and solve my problem.

I posted a case in How to handle example data in R Package that has UTF-8 marked strings

Shixiang Wang
  • 2,147
  • 2
  • 24
  • 33