1

I'm having trouble reading in data from a file with unusual symbols; there is no error message but it stops once it hits a line with a specific symbol.

temp = read.csv(filePaths[i], header=TRUE, sep="\t", comment.char="#")

The last field which is read in is

Familial Non-VHL Clear Cell Renal Cancer;Birt-Hogg-Dub

Reading the file in Excel, this actually reads:

Familial Non-VHL Clear Cell Renal Cancer;Birt-Hogg-Dub-> Syndrome

but the "->" is a symbol; I believe this actually is "Birt–Hogg–Dubé syndrome", and the last character is probably being interpreted as an EOF char.

I only have this problem on Windows.

I've tried using different encoding (encoding = "UTF-8" and encoding = "bytes", fileEncoding = "UTF-8") without any difference. I've looked at Cannot read unicode .csv into R and searched but can't easily find an answer. Note that I probably can't use a specific language encoding. Thanks!

-- Update -- Created a file with one column, a header, 3 entries (problematic entry at #2), found here: https://www.dropbox.com/s/3m2wak8rhyab6j2/test.txt?dl=0

Community
  • 1
  • 1
Dave Liu
  • 113
  • 1
  • 5
  • Can you post the file (or the relevant part of it) to make your problem reproducible? – lukeA Mar 20 '16 at 17:38
  • Just did, thanks for any help! – Dave Liu Mar 20 '16 at 21:22
  • 1
    I see. `\032` is making problems. `download.file("https://www.dropbox.com/s/3m2wak8rhyab6j2/test.txt?dl=1", tf <- tempfile(fileext = ".csv"));library(stringi);read.csv(text=stri_read_lines(tf), header=T)` will load it. – lukeA Mar 20 '16 at 21:38
  • I saw this answered here: http://r.789695.n4.nabble.com/End-of-line-marker-td1577062.html – statsNoob Apr 19 '16 at 21:04

1 Answers1

0

As you've guessed, you need to change your file encoding when reading your file.

  1. Determine what file encoding you have

  2. Use read.table specifying that file encoding via fileEncoding

csgillespie
  • 59,189
  • 14
  • 150
  • 185
  • Thanks, via the suggestion, I used Notepad and found that encoding was "ANSI". When I tried fileEncoding = "ANSI" it gave "Unsupported conversion from 'ANSI' to ''. Then I tried saving the file changing the encoding from "ANSI" to "UTF-8" but am still encountering errors while using fileEncoding="UTF-8" – Dave Liu Mar 20 '16 at 21:32
  • When you opened it in Notepad, was it properly displayed? – csgillespie Mar 20 '16 at 21:48
  • Yes, it displayed all data (past the problematic character) if that is what you mean. – Dave Liu Mar 21 '16 at 01:00