0

I have a big CSV file (more than 100,000 lines). Except the first column (which is the time), this file is supposed to contain only numeric values.

However, when I read the file using read.csv function, some of the data frame variables only contain characters.

I tried to convert those variables in numeric with the as.numeric function and then I get an warning message:

Warning message:

NAs introduced by coercion

My guess is that there is an error in how the CSV file is written. I want to fix that error, but to do that, I need to know where are the problems.

Is there a way to have more detailed information why (and especially where) as.numeric fails to convert my data to numeric?

Ben
  • 6,321
  • 9
  • 40
  • 76

1 Answers1

1

It would help if you provided a reproducible example. Since you didn't, I'll try to make one. Hopefully, this will be worthwhile.

write.csv(read.table(text="A 1  2   3
B   2   3   4
C   3   4   5
D   4   5   6
E   5   q   7
F   6   7   8
G   7   8   9", header=F, stringsAsFactors=F), file="file.csv", row.names=F)

This file should open with numeric variables (other than the first column), but the third column is screwed up somewhere:

my.data <- read.csv("file.csv", stringsAsFactors=F, header=T)
str(my.data)
# 'data.frame':  7 obs. of  4 variables:
#  $ V1: chr  "A" "B" "C" "D" ...
#  $ V2: int  1 2 3 4 5 6 7
#  $ V3: chr  "2" "3" "4" "5" ...
#  $ V4: int  3 4 5 6 7 8 9

One thing you can try is to read in the file one row at a time, and see when the class of the variable in question switches from numeric:

for(i in 1:7){
  row <- read.csv("file.csv", stringsAsFactors=F, header=T, skip=i-1, nrows=1)
  if(is.character(row[1,3])){ break() }
}
i         # [1] 5
row[1,3]  # [1] "q"

Having discovered the culprit (or at least the first one), you could open the file in Excel, or something like EditPad, and navigate to the bad element and fix it. (Be sure to do this to a copy of the file; keep the original—warts and all—for a permanent record.)


Edit:
If you already have a vector with NAs, you can find out where they are with:

which(is.na(as.numeric(my.data[,3])))
# [1] 5
# Warning message:
# In which(is.na(as.numeric(my.data[, 3]))) : NAs introduced by coercion

Or jump straight to inspecting the elements with:

my.data[which(is.na(as.numeric(my.data[,3]))),3]
# [1] "q"
# Warning message:
# In which(is.na(as.numeric(my.data[, 3]))) : NAs introduced by coercion
Community
  • 1
  • 1
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79