6

Preliminary information OS: Windows XP Professional Version 2002 Service Pack 3; R version: R 2.12.2 (2011-02-25)

I am attempting to read a 30,000 row by 80 column, tab-delimited text file into R using the read.delim() function. This file does have column headers with following naming convention: "_". The code that I use to attempt to read the data in is:

cc <- c("integer", "character", "integer", rep("character", 3), 
        rep("integer", 73))

example_data <- read.delim(file = 'C:/example.txt', row.names = FALSE,
                           col.names = TRUE, as.is = TRUE, colClasses = cc)

After I submit this command, I receive the following error message:

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
more columns than column names
In addition: Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  header and 'col.names' are of different lengths

Information that may be important - from column 8 until column 80 the count of zeros in each column is as follows:

column 08: 29,000 zeros
column 13: 15,000 zeros
column 19: 500 zeros
column 43: 15,000 zeros
columns 65-80: 29,000 zeros for each column

Can anyone help identify reasons that I am receiving the above error messages? Any help will be greatly appreciated.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Jubbles
  • 4,450
  • 8
  • 35
  • 47
  • 4
    What does this return: `count.fields(file = 'C:/example.txt', sep="\t")[1:10]` ? – IRTFM Sep 02 '11 at 13:47
  • @James: You are correct - cc is of length 79, which is the actual number of columns in my file. I rounded the dimensions in my post. – Jubbles Sep 02 '11 at 13:51
  • @DWin: I've been using R for a few years, and I learn something new everyday. Thanks for introducing the `count.fields()` function to me. – Jubbles Sep 02 '11 at 13:59
  • @Jubbles : you're welcome. I consider `count.fields` an essential part of the data input toolkit. It's also useful for identifying which lines have those "weird bits" like unmatched quotes or unexpected comment characters. – IRTFM Sep 02 '11 at 14:44

3 Answers3

7

The cause of the problem is your use of the col.names=TRUE argument. This is supposed to be used manually to specify column names for the resulting data frame, and therefore must be a vector with the same length as there are columns in the input, one name per column.

f you want read.delim to take column names from the file, consider using header=TRUE; you may also wish to reconsider row.names=TRUE as again this is intended as a specification of the row names rather than an instruction to read them from the file.

More information is available on the help page for read.delim.

MatthewS
  • 526
  • 3
  • 4
  • You are correct. I now feel a bit embarrassed by my question. I'm so used to writing data with `write.table(..., row.names = FALSE, col.names = TRUE, ...)` that I forgot that when **reading in** data, `col.names` specifies a vector of character types. – Jubbles Sep 02 '11 at 13:54
  • No need to feel embarrassed, that inconsisetency between `read.table` and `write.table` is rather a trap! – MatthewS Sep 02 '11 at 13:57
5

I also recently had the same error and it disappeared after converting the file to comma or semicolon delimited and read it with read.csv / read.csv2. I know this is not a fullfillig answer but maybe you might check that out.

langohrschnauze
  • 450
  • 2
  • 4
  • 8
-1

If you want to read as character matrix then first convert your file into .csv format and use read.csv. Don't use any other declaration other than file name. e.g.;

read.csv("filepath")
Mithilesh Kumar
  • 256
  • 1
  • 3
  • 18