6

I have a comma separated dataset of around 10,000 rows. When doing read.csv, R created a dataframe rows lesser than the original file. It excluded/rejected 200 rows. When I open the csv file in Excel, the file looks okay. The file is well formatted for line delimiters and also field delimiters (as per parsing done by Excel).

I have identified the row numbers in my file which are getting rejected but I can't identify the cause by glancing over them.

Is there any way to look at logs or something which includes reason why R rejected these records?

user3422637
  • 3,967
  • 17
  • 49
  • 72
  • Are the 200 rejects randomly seeded throughout, or are they co-located in a particular area of your data? Have you seen `http://stackoverflow.com/questions/13706188/importing-csv-file-into-r-numeric-values-read-as-characters` – zipzit Sep 29 '14 at 07:31
  • They are in one area of the data. They are continuous records – user3422637 Sep 29 '14 at 07:40
  • Unlike the case presented in the link you shared, I have not done any data manipulations on Excel. I opened raw data on R. I had only opened the data on Excel to view it parsed.. didn't alter the file. – user3422637 Sep 29 '14 at 07:43
  • 3
    Please could you open the csv file in a text file and copy a few of the lines to your post. Also please include the exact code you used to read the data into R – Rich Scriven Sep 29 '14 at 07:49
  • Some guesses: something goes wrong with quotes (you have an unescaped quote somewhere; probably in the line just before the lines that are skipped); the lines start with a `#`. – Jan van der Laan Sep 29 '14 at 07:54
  • Jan van der Laan was correct. The records rejected was due to presence of double quotes in the csv file. Jan, please go ahead and add that as the answer. You deserve the credit. However, I am not sure how to remove the quotes using R before applying read.csv(). Can anyone help me with that? Currently, I removed the double quotes in notepad++ but I would like to do that step in R. – user3422637 Sep 30 '14 at 03:13
  • I am not sure how I found share the data. Even if I want to share one record, it is too long a text and messes up the textbox on stackoverflow. I don't see any way of attaching a file. Anyways, the issue has been recognized and mentioned in the previous comment – user3422637 Sep 30 '14 at 03:17

4 Answers4

18

The OP indicates that the problem is caused by quotes in the CSV-file.

When the records in the CSV-file are not quoted, but only a few records contain quotes. The file can be opened using the quote="" option in read.csv. This disables quotes.

data <- read.csv(filename, quote="")

Another solution is to remove all quotes from the file, but this will also result in modified data (your strings don't contain any quotes anymore) and will give problems of your fields contain comma's.

lines <- readLines(filename)
lines <- gsub('"', '', lines, fixed=TRUE)
data <- read.csv(textConnection(lines))

A slightly more safe solution, which will only delete quotes when not just before or after a comma:

lines <- readLines(filename)
lines <- gsub('([^,])"([^,])', '\\1""\\2', lines)
data <- read.csv(textConnection(lines))
Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
5

I had same issue where difference between number of rows present in csv file and number of rows read by read.csv() command was significant. I used fread() command from data.table package in place of read.csv and it solved the problem.

Jefferson
  • 794
  • 10
  • 24
Adnan Khalid
  • 51
  • 1
  • 2
1

The records rejected was due to presence of double quotes in the csv file. I removed the double quotes on notepad++ before reading the file in R. If you can suggest a better way to remove the double quotes in R (before reading the file), please leave a comment below.

Pointed out by Jan van der Laan. He deserves the credit.

user3422637
  • 3,967
  • 17
  • 49
  • 72
0

In your last question you want to remove double quotes (that is "") before reading the csv file in R. This probably is best done as a file preprocessing step using a one line Shell scripting "sed" comment (treated in the Unix & Linux forum).

sed -i 's/""/"/g' test.csv
Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33