Weird characters appearing in text column in R

Question

I am transferring a CSV file over from Excel in R. One of my columns contains text for each observation but ends up showing up in the following way:

"Hey! \x8c\xe6 Maybe I can give some suggestions: \x8c\xe6"

What's going on with the \x8c\xe6? Is there anyway I can do something so that I only have A-Z,a-z, and characters such as .,+/\?*() etc.

This is due to incorrectly specified encoding. You can specify the encoding when saving from Excel. Save As -> Tools -> Web Options -> Encoding. Probably set this to utf-8. — jbaums, Jun 12 '14 at 00:14

score 3 · Accepted Answer · answered Jun 12 '14 at 00:08

How about removing all non-printable characters with gsub

a <- "Hey! \x8c\xe6 Maybe I can give some suggestions: \x8c\xe6"
gsub("[^[:print:]]","",a)

# [1] "Hey!  Maybe I can give some suggestions: "

The [:print:] class and others are defined on the ?regex help page.

score 1 · Answer 2 · edited May 23 '17 at 12:27

1

That's an encoding error, I've gotten those a lot in R (see this encoding table to get a sense of the translation issue). I did this totally inefficient thing where I would use 'gsub' for the errors I could see, simply deleting them:

gsub('\\x8c\\xe6', '', data)

However, this post may help in detecting the correct encoding: How to detect the right encoding for read.csv?

edited May 23 '17 at 12:27

Community

1
1

answered Jun 12 '14 at 00:19

sclarky

721
5
11

Weird characters appearing in text column in R

2 Answers2