1

I am transferring a CSV file over from Excel in R. One of my columns contains text for each observation but ends up showing up in the following way:

"Hey! \x8c\xe6 Maybe I can give some suggestions: \x8c\xe6" 

What's going on with the \x8c\xe6? Is there anyway I can do something so that I only have A-Z,a-z, and characters such as .,+/\?*() etc.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
theamateurdataanalyst
  • 2,794
  • 4
  • 38
  • 72
  • This is due to incorrectly specified encoding. You can specify the encoding when saving from Excel. Save As -> Tools -> Web Options -> Encoding. Probably set this to utf-8. – jbaums Jun 12 '14 at 00:14

2 Answers2

3

How about removing all non-printable characters with gsub

a <- "Hey! \x8c\xe6 Maybe I can give some suggestions: \x8c\xe6"
gsub("[^[:print:]]","",a)

# [1] "Hey!  Maybe I can give some suggestions: "

The [:print:] class and others are defined on the ?regex help page.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
1

That's an encoding error, I've gotten those a lot in R (see this encoding table to get a sense of the translation issue). I did this totally inefficient thing where I would use 'gsub' for the errors I could see, simply deleting them:

gsub('\\x8c\\xe6', '', data)

However, this post may help in detecting the correct encoding: How to detect the right encoding for read.csv?

Community
  • 1
  • 1
sclarky
  • 721
  • 5
  • 11