50

I was reading in a csv file.

Code is:

mydata = read.csv("mycsv.csv", header=True, sep=",", quote="\"")

Get the following warning:

Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input

Now some cells in my CSV have missing values that are represented by "".

How do I write this code so that I do not get the above warning?

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
user1172468
  • 5,306
  • 6
  • 35
  • 62
  • 2
    Does this opencsv bug report : http://sourceforge.net/p/opencsv/bugs/96/ : look like it might have led to your CSV file having nulls? If that's not it and you're on a Linux system, `tr -d '\000' < filein > fileout` will remove the nulls, but that might not fully fix your issue. – hrbrmstr Apr 22 '14 at 02:57
  • MMMmm I'll check ... good find – user1172468 Apr 22 '14 at 02:59

7 Answers7

60

Your CSV might be encoded in UTF-16. This isn't uncommon when working with some Windows-based tools.

You can try loading a UTF-16 CSV like this:

read.csv("mycsv.csv", ..., fileEncoding="UTF-16LE")
nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • Thanks but that didn't file it ... I am pretty sure I'm not dealing with a UTF-16LE file – user1172468 Apr 22 '14 at 02:49
  • @user1172468: Have you tried looking at the file in a hexeditor? There might be embedded NULs, I guess. What program generated your CSVs? – nneonneo Apr 22 '14 at 02:50
  • I got the following: Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'mycsv.csv' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'mycsv.csv' – user1172468 Apr 22 '14 at 02:51
  • I generated it using opencsv in Java -- I'm pretty confident that there are no utf-16 characters in the file -- but I could always be wrong – user1172468 Apr 22 '14 at 02:52
  • I don't know what is "UTF-16LE", but it helped me!! – Kuber Jan 06 '16 at 11:59
41

You can try using the skipNul = TRUE option.

mydata = read.csv("mycsv.csv", quote = "\"", skipNul = TRUE)

From ?read.csv

Embedded nuls in the input stream will terminate the field currently being read, with a warning once per call to scan. Setting skipNul = TRUE causes them to be ignored.

It worked for me.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
Apex Chimps
  • 429
  • 4
  • 6
  • @Richard, @Apex, or someone, will you please point me to a resource or 1) define an "embedded nul" and 2) explain in more detail what `skipNul = TRUE` does? Thanks. – Daniel Fletcher Apr 24 '16 at 03:01
  • Null is ASCII value `0` (Hx0), called `NUL` or `null` (check any ASCII table). A (manipulated or converted) string can contain these characters. Sometime they are rendered as `\0`, as in `ABC\0EFG`. SkipNul = TRUE ignores them. – Enzo Apr 25 '16 at 07:23
  • @Enzo thanks for the feedback. I'm inferring R thinks embedded nulls should become `NA`s; however, because the argument for `na.strings = ` is unclear or insufficient to convert _all_ embedded nulls to `NA`s, R leaves the remainders as text strings with their source value. Correct? If so, is there any way to determine which data points R ignored as embedded nulls? (The goal being to convert them to `NA`s.) Thanks, again. – Daniel Fletcher Apr 25 '16 at 18:23
  • In my case (the output from a machine) skipNul was fine. – JASC Sep 03 '19 at 22:36
  • But with the output of another machine, I had to use UTF-16LE encoding. – JASC May 21 '20 at 02:11
4

This is nothing to do with the encoding. This is the problem with reading of the nulls in the file. To handle that, you need to pass the skipNul = TRUE paramater.

for example:

neg = scan('F:/Natural_Language_Processing/negative-words.txt', what = 'character', comment.char = '', encoding = "UTF-8", skipNul = TRUE)

user1981275
  • 13,002
  • 8
  • 72
  • 101
2

Might be a file that do not have CRLF, might only have LF. Try to check the HEX output of the file.

If so. Try running the file through awk:

awk '{printf "%s\r\n", $0}' file > new_log_file
steinmb
  • 91
  • 1
  • 4
1

I had the same error message and figured out that although my files had a .csv extensions and opened up with no problems in a spreadsheet, they were actually saved as ¨All Formats¨ rather than ¨Text CSV (.csv)¨

SCallan
  • 681
  • 9
  • 26
1

Another quick solution:

Double check that you are, in fact, reading a .csv file!

I was accidentally reading a .rds file instead of .csv and got this "embedded null" error.

Paul
  • 3,920
  • 31
  • 29
0

In those cases be sure the data you are importing does not have "#" characters but if that the case try using the option comment.char="". It worked for me.

Phill
  • 184
  • 7