2

I'm trying to read a .csv file into R. The .csv file was created in Excel, and it contains "long" dashes, which are the result of Excel "auto-correcting" the sequence space-dash-space. Sample entries that contain these "long" dashes:

US – California – LA
US – Washington – Seattle

I've experimented with different encoding, including the following three options:

x <- read.csv(filename, encoding="windows-1252") # Motivated by http://www.perlmonks.org/?node_id=551123
x <- read.csv(filename, encoding="latin1")
x <- read.csv(filename, encoding="UFT-8")

But, the long dashes either show up as � (first and second option) or as <U+0096> (third option).

I realize that I can store the file in different formats or use different software (Excel to CSV with UTF8 encoding) but that's not the point.

Has anyone figured out what encoding option in R works in such cases?

Community
  • 1
  • 1
PMaier
  • 592
  • 7
  • 19
  • For the third option, you can also clean your data afterwards with for example `gsub("", "", x, fixed=TRUE)` – Jaap Oct 21 '15 at 18:08

1 Answers1

0

If you are using RStudio, use Import Dataset.

  • Use Heading: No
  • Separator Whitespace
  • Decimal Period
  • Quote Double quote
  • uncheck strings as factors

when your document is loaded you can simply remove the columns that now show as '?' You can see this is column 2 and column 4. If you have a dataframe, mydf, then you would delete the second column like this.

mydf_new<-mydf[-2]

You could do the same thing for the other column, which is now column 3.

Scott
  • 642
  • 7
  • 16
  • This workaround is not bad, except for one thing: In the example I provided, there were a fixed number of columns with dashes. In reality, that number is not fixed... and it doesn't really address whether there is an encoding that would avoid such workarounds. – PMaier Oct 21 '15 at 19:14