2

I have a large CSV file (8.1 GB) that I'm trying to wrangle into R. I created the CSV using Python's csvkit in2csv, converted from a .txt file, but somehow the conversion led to null characters showing up in the file. I'm now getting this error when importing:

Error in fread("file.csv", nrows = 100) : embedded nul in string: 'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0'

I am able to import small chunks just fine with read.csv though, but that's because it allows for UTF-16 encoding via the fileEncoding argument.

test <- read.csv("file.csv", nrows=100, fileEncoding="UTF-16LE")

I don't dare try to import an 8 GB file with read.csv, though.

So I then tried the solution offered here, in which you use sed s/\\0//g file.csv > file2.csv to pull the nulls out. The command performed just fine and populated a new 8GB CSV file, but I received a nearly-identical error:

Error in fread("file2.csv", nrows = 100) : embedded nul in string: 'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0,\0p\0o\0s\0t\0_\0z\0i

So, that didn't work. I'm stumped at this point. Considering the size of the file, I can't use read.csv on the whole thing, and I'm not sure how to get rid of the nulls in the original CSV. I'm not even sure how the file got encoded as UTF-16. Any suggestions or advice would be greatly appreciated at this point.

Edit: I'm on a Windows machine.

Community
  • 1
  • 1
AnnotBib
  • 21
  • 1
  • 3
  • unless you're never going to get this data feed again thru the same process, it seems like it would be worth-while fixing the data at the source, :-( ... but that is a different question. Good luck! – shellter Nov 22 '14 at 13:28

3 Answers3

3

If you're on linux/mac, try this

file <- "file.csv"
tt <- tempfile()  # or tempfile(tmpdir="/dev/shm")
system(paste0("tr < ", file, " -d '\\000' >", tt))
fread(tt)
GSee
  • 48,880
  • 13
  • 125
  • 145
  • I'm unfortunately on a Windows (should have specified that earlier, apologies); I do, however, have GnuWin32 installed. That's how I was able to use `sed` before. Is there an equivalent I could run? – AnnotBib Nov 21 '14 at 20:29
0

A possible option would be to install bash emulator on your machine from http://win-bash.sourceforge.net/ , and remove null terminated strings using Linux tools, as described, for example, here: Identifying and removing null characters in UNIX or here 'Embedded nul in string' error when importing csv with fread

Community
  • 1
  • 1
Nikita Barsukov
  • 2,957
  • 3
  • 32
  • 39
0

I think the nonsensical characters happen because the file is compressed. This is what I found when trying to read vcf.gz files. fread does not seem to support reading compressed files. See e.g. https://github.com/Rdatatable/data.table/issues/717

readLines() and read.table() support compressed files, but they are slower.

CoderGuy123
  • 6,219
  • 5
  • 59
  • 89