2

I did a lot of research on this and I still can't find a solution to this.

I have extracted data from a German Facebook group that looks like

from_ID         from_name           message                                        created_time
12334543        Max Muster          Dies war auch eine sehr sch<U+00F6>ne Bucht    2016-01-08T19:00:54+0000

I understand that <U+00F6> stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language).

No matter if I want to do a sentiment analysis or just produce a wordcloud I sometimes have issues with this. In case of the sentiment an issue is that training data is not containing these Unicodes and hence the prediction/classification goes wrong. In case of other text based procedures text cleaning like stopword removal is a problem as stop word lists are also "clean" and do not feature these codes.

Is there any easy way to get rid of this and to make R display the corresponding sign instead of the code?

I tried a lot. My last resort would be a gsub routine. However my data frame includes more than 1 million comments. In addition gsub would be very painful as there seems to be too many Unicodes (if we think of more languages than German).

If I got it right it is also important what kind of computer I am using. It is a MacBook Pro.

Any help here is really really appreciated!!

Thank you a lot for your time and help!

rkuebler
  • 95
  • 1
  • 11

1 Answers1

1

It's a bit mystifying, but this will do it:

message <- c("Dies war auch eine sehr sch<U+00F6>ne Bucht", 
             "Schlo<U+00DF> Sch<U+00F6>nbrunn.")

# convert the <U+00xx> format to R's \\u00xx format for escaped Unicode
message2 <- stringi::stri_replace_all_fixed(message, c("<U+", ">"), c("\\u", ""), vectorize_all = FALSE)

# convert to native through parsing and coercing
as.character(parse(text = shQuote(message2)))
## [1] "Dies war auch eine sehr schöne Bucht" "Schloß Schönbrunn." 
Ken Benoit
  • 14,454
  • 27
  • 50
  • thank you very much for this great advice. I was already thinking that there is a formatting issue. I tried your code. However, it leads to totally different results and unfortunately does not yet fully work. Here's the outcome of your code Message (before applying your code): "wer kann kurzfristig fr uns einspringen?" Message2 after applying your code: expression('wer kann kurzfristig fr uns einspringen?' – rkuebler Jan 12 '16 at 11:43
  • Those are only intermediate values, what you want is the evaluation of the third statement. I tried this on Windows 7 and it works fine, with locale set as follows: `> Sys.getlocale("LC_CTYPE") [1] "English_United States.1252"` – Ken Benoit Jan 12 '16 at 17:30
  • Thank you again for your patience and help. In my case Sys.getlocale() returns [1] "C" I guess there is a problem with my settings, right? – rkuebler Jan 16 '16 at 15:02
  • 1
    See https://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Internationalization-of-the-R_002eapp (item 7) - on OS X you should use that to set your R system encoding to UTF-8, and the problem should be solved. – Ken Benoit Jan 22 '16 at 11:36
  • 1
    Thank you a million times! This solved all issues!!! I hope I can one day return someone such a great favor and help! – rkuebler Jan 25 '16 at 14:59
  • Does this solution work [here](http://stackoverflow.com/q/41873359/164148) as well? – hhh Jan 27 '17 at 06:45