8

I have a file that has 22268 rows BY 2521 columns. When I try to read in the file using this line of code:

file <- read.table(textfile, skip=2, header=TRUE, sep="\t", fill=TRUE, blank.lines.skip=FALSE)

But I only get 13024 rows BY 2521 columns read in and the following error:

Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns

I also used this command to see what rows had an incorrect number of columns:

x <-count.fields(textfile, sep="\t", skip=2)
incorrect <- which(x != 2521)

and got back a list of about 20 rows that were incorrect.

Is there a way to fill these rows with NA values?

I thought that is what the "fill" parameter does in the read.table function, but it doesn't appear so.

OR

Is there a way to ignore these rows that are identified in the "incorrect" variable?

A. Suliman
  • 12,923
  • 5
  • 24
  • 37
Sheila
  • 2,438
  • 7
  • 28
  • 37
  • 2
    check for spurious quotation marks and comment characters ... – Ben Bolker Dec 03 '12 at 23:32
  • 1
    There is no need for `R` in the title as the `r` `tag` makes that obvious. – mnel Dec 04 '12 at 01:26
  • Is there a way to avoid reading in the rows that are listed in "incorrect"? – Sheila Dec 04 '12 at 03:05
  • @Sheila, If the goal is to have a data frame with only the correct lines, you can still read-in the bad lines to a temporary variable and then from the temporary variable read only the "correct" lines. – Ricardo Saporta Dec 04 '12 at 04:25
  • Hi All. I was able to solve this problem. after obtaining the number of "incorrect" rows, I created a vector called "allRows" that ranged from 1 to 22268 (included header), and found the rows I wanted to read in, by removing any of the row indexes that were "incorrect". Then I was able to use readTable (from the R.utils library) to read in the rows that I actually wanted. Thanks for everyone's help. – Sheila Dec 06 '12 at 02:39

1 Answers1

5

you can use readLines() to input the data, then find the offending rows.

    con <- file("path/to/file.csv", "rb")
    rawContent <- readLines(con) # empty
    close(con)  # close the connection to the file, to keep things tidy

then take a look at rawContent

To find the rows with an incorrect number of columns, for example:

    expectedColumns <- 2521
    delim <- "\t"

    indxToOffenders <-
    sapply(rawContent, function(x)   # for each line in rawContent
        length(gregexpr(delim, x)[[1]]) != expectedColumns   # count the number of delims and compare that number to expectedColumns
    ) 

Then to read in your data:

  myDataFrame <- read.csv(rawContent[-indxToOffenders], header=??, sep=delim)
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Hi @Ricardo - Is there a specific library you are suggesting to use? In your example above you typed "library" then "expectedColumns <- 2521" etc... – Sheila Dec 04 '12 at 01:08
  • Hi @Sheila, I was originally going to recommend using `stringr` in lieu of regex, but there was no need. – Ricardo Saporta Dec 04 '12 at 01:18
  • Hi @Ricardo, and just to be clear, there should also be an assignment to indxToOffenders correct? So we see " indxToOffenders <- sapply...."? – Sheila Dec 04 '12 at 02:13
  • Hi @Ricardo. In my code above, the variable "incorrect", has the index of the rows that do not have all 2521 rows. Is there a way to avoid reading in these rows? – Sheila Dec 04 '12 at 03:07
  • Yep, you can use the argument n in readLines to only read in a few at a time. Can I ask why not just read them, but then not add them to your data frame? – Ricardo Saporta Dec 04 '12 at 03:51
  • Note the last line in this answer had `rawContent[-indxToOffenders]`, meaning that the bad rows will be excluded. (You can substitute `Incorrect` for `indxToOffenders` if you'd like) – Ricardo Saporta Dec 04 '12 at 03:54
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/20595/discussion-between-sheila-and-ricardo-saporta) – Sheila Dec 05 '12 at 00:51