How to remove strange characters using gsub in R?

Question

I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8').

If I don't specify the encoding, I see all kinds of strange characters like:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> ðŸ˜œðŸ˜â˜º'"

This is what it looks like after readLines(..., encoding='UTF-8'):

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

You can see the unicode literals at the end: \u009f, \u0098, etc.

I can't find the right command and regular expression to get rid of these. I've tried:

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

I also tried specifying the unicode characters, but I believe they're getting interpreted as text:

gsub('\u009', '', text) # Unchanged

score 15 · Answer 1 · answered May 17 '18 at 18:15

If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:

text = "The way I talk to my family......i would get my ass beat to 
DEATH....but they kno I cray cray & just leave it at that ðŸ˜œðŸ˜â˜º'"

gsub('[^\x20-\x7E]', '', text)

# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"

Below is a table of ASCII codes taken from asciitable.com:

You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).

This is much better than iconv! – ishonest Oct 05 '21 at 01:43 — ishonest, Oct 05 '21 at 01:43

score 10 · Answer 2 · edited Aug 08 '16 at 14:18

10

The easiest way to get rid of these characters is to convert from utf-8 to ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')

edited Aug 08 '16 at 14:18

majom

7,863
7
55
88

answered Aug 08 '16 at 11:57

Nate Reed

6,761
12
53
67

How to remove strange characters using gsub in R?

2 Answers2

Linked

Related