Some .csv
files with numerical data I work with contain errors, each error is marked as random string, for example after reading in, data frame could look like that :
set.seed(123)
rand.str <- paste0(letters[sample(10)], collapse="")
wrong.output <- data.frame(a=1:5, b=c(4:5, rand.str, 7:8), stringsAsFactors=FALSE)
in this case proper output is :
proper.output <- data.frame(a=1:5, b=c(4:5, NA, 7:8))
after reading with read.csv
each column with at least one character value is treated as character
column.
Can I mark errors (random strings) as NA
s while reading-in file? If not, what is the most convenient, proper or fastest method for subsetting them with NA
's ?
There is na.strings
argument in read.csv
, but it is the solution only in simpler cases where it can be used like: na.strings=c("-", "unavailable")
(can't see any duplicate, so I guess there is simple, workaround)
colClasses
suggestion does not work
read.csv("test.txt", sep=",", colClasses = c("numeric", "numeric"))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : scan() expected 'a real', got 'chdgfajibe' In addition: Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.txt'