8

I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8').

If I don't specify the encoding, I see all kinds of strange characters like:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"

This is what it looks like after readLines(..., encoding='UTF-8'):

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

You can see the unicode literals at the end: \u009f, \u0098, etc.

I can't find the right command and regular expression to get rid of these. I've tried:

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

I also tried specifying the unicode characters, but I believe they're getting interpreted as text:

gsub('\u009', '', text) # Unchanged
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
Nate Reed
  • 6,761
  • 12
  • 53
  • 67

2 Answers2

15

If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:

text = "The way I talk to my family......i would get my ass beat to 
DEATH....but they kno I cray cray & just leave it at that 😜ðŸ˜â˜º'"

gsub('[^\x20-\x7E]', '', text)

# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"

Below is a table of ASCII codes taken from asciitable.com:

enter image description here

You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).

acylam
  • 18,231
  • 5
  • 36
  • 45
10

The easiest way to get rid of these characters is to convert from utf-8 to ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')
majom
  • 7,863
  • 7
  • 55
  • 88
Nate Reed
  • 6,761
  • 12
  • 53
  • 67