Importing .csv files with "special" characters

Question

I'm trying to read a .csv file into R. The .csv file was created in Excel, and it contains "long" dashes, which are the result of Excel "auto-correcting" the sequence space-dash-space. Sample entries that contain these "long" dashes:

US – California – LA
US – Washington – Seattle

I've experimented with different encoding, including the following three options:

x <- read.csv(filename, encoding="windows-1252") # Motivated by http://www.perlmonks.org/?node_id=551123
x <- read.csv(filename, encoding="latin1")
x <- read.csv(filename, encoding="UFT-8")

But, the long dashes either show up as � (first and second option) or as <U+0096> (third option).

I realize that I can store the file in different formats or use different software (Excel to CSV with UTF8 encoding) but that's not the point.

Has anyone figured out what encoding option in R works in such cases?

For the third option, you can also clean your data afterwards with for example `gsub("", "", x, fixed=TRUE)` — Jaap, Oct 21 '15 at 18:08

score 0 · Answer 1 · answered Oct 21 '15 at 17:15

0

If you are using RStudio, use Import Dataset.

Use Heading: No
Separator Whitespace
Decimal Period
Quote Double quote
uncheck strings as factors

when your document is loaded you can simply remove the columns that now show as '?' You can see this is column 2 and column 4. If you have a dataframe, mydf, then you would delete the second column like this.

mydf_new<-mydf[-2]

You could do the same thing for the other column, which is now column 3.

answered Oct 21 '15 at 17:15

Scott

642
7
16

This workaround is not bad, except for one thing: In the example I provided, there were a fixed number of columns with dashes. In reality, that number is not fixed... and it doesn't really address whether there is an encoding that would avoid such workarounds. – PMaier Oct 21 '15 at 19:14

Importing .csv files with "special" characters

1 Answers1