read.csv vs. read.table

Question

I have seen in several cases that while read.table() is not able to read a tab delimited file (for example the annotation table of a microarray) returning the following error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
line xxx did not have yyy elements

read.csv() works perfectly on the same file with no errors. I think also the speed of read.csv() is also higher than read.table().

Even more: read.table() is doing very crazy reading a file of me. It makes this error while reading line 100, but when I copy and paste lines 90 to 110 just after the head of the same file, it still makes error of line 100+21 (new lines copied at the beginning). If there is any problem with that line, why doesn't it report that error while reading the pasted line at the beginning? I confirm that read.csv() reads the same file with no error.

Do you have any idea of why read.table() is unable to read the same files that read.csv() works on it? Also is there any reason to use read.table() in any cases?

Also read the help page for `read.table()` under memory usage as to why it may appear slow for large files. — Chase, Oct 10 '12 at 21:23
we can't answer your (updated) question without a reproducible example. The most common reading problems are (1) undetected comment characters, (2) unmatched quotation marks, (3) changes in number of fields per line after the first 5 lines of the file when `fill=TRUE`. Because `read.csv` and `read.table` have different default values for `comment`, `quote`, and `fill`, any of these could be the problem. — Ben Bolker, Oct 10 '12 at 21:34
PS there are 8 combinations of `comment`/`quote`/`fill`: you could experiment with all of them and see how the results differ -- that might lead you to the answer. `count.fields()` is handy for diagnostics too. — Ben Bolker, Oct 10 '12 at 21:43
The collective experience of every R expert I've ever encountered has led me to develop a prior for problems like this that basically consists of the [Dirac Delta](http://en.wikipedia.org/wiki/Dirac_delta_function) function with infinite mass on "there's a weird line/character in your file", not a problem with `read.table` or `read.csv`. — joran, Oct 10 '12 at 21:44
@joran: yes, but the interaction of `fill` and differing number of fields is so weird that it is a severe "misfeature", maybe even a ((bug)) ... Ali, you generally just have to post your file/share it via one of the many public file-sharing mechanisms ... — Ben Bolker, Oct 10 '12 at 21:56
PS @joran, it occurs to me that you probably don't mean "infinite mass", you probably mean "unit mass, infinitely concentrated" (infinite density) ... — Ben Bolker, Oct 06 '15 at 19:20

Ben Bolker · Accepted Answer · 2013-02-04T16:21:07.803

read.csv is a fairly thin wrapper around read.table; I would be quite surprised if you couldn't exactly replicate the behaviour of read.csv by supplying the correct arguments to read.table. However, some of those arguments (such as the way that quotation marks or comment characters are handled) could well change the speed and behaviour of the function.

In particular, this is the full definition of read.csv:

function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
    fill = TRUE, comment.char = "", ...) {
     read.table(file = file, header = header, sep = sep, quote = quote, 
        dec = dec, fill = fill, comment.char = comment.char, ...)
}

so as stated it's just read.table with a particular set of options.

As @Chase states in the comments below, the help page for read.table() says just as much under Details:

read.csv and read.csv2 are identical to read.table except for the defaults. They are intended for reading ‘comma separated value’ files (‘.csv’) or (read.csv2) the variant used in countries that use a comma as decimal point and a semicolon as field separator.

Good answer - I would just add that the help page for `read.table()` says just as much under details `read.csv and read.csv2 are identical to read.table except for the defaults. They are intended for reading ‘comma separated value’ files (‘.csv’) or (read.csv2) the variant used in countries that use a comma as decimal point and a semicolon as field separator.`. So to the OP - yes, you would want `read.table` when your data don't match the default values for `read.csv` — Chase, Oct 10 '12 at 21:22

score 10 · Answer 2 · answered Oct 10 '12 at 23:00

10

Don't use read.table to read tab-delimited files, use read.delim. (It is just a thin wrapper around read.table but it sets the options to appropriate values)

answered Oct 10 '12 at 23:00

hadley

102,019
32
183
245

score 0 · Answer 3 · edited Apr 28 '15 at 20:09

0

read_table() does fail sometime on tab sep'ed file and setting sep='\s+' may help assuming item in your table have no space

edited Apr 28 '15 at 20:09

Bhargav Rao

50,140
28
121
140

answered Jun 20 '14 at 14:26

user3760541

1

read.csv vs. read.table

3 Answers3

Linked