Reading file containing numerical values and unknown errors (random strings) in R

Question

Some .csv files with numerical data I work with contain errors, each error is marked as random string, for example after reading in, data frame could look like that :

set.seed(123)
rand.str <-  paste0(letters[sample(10)], collapse="")
wrong.output <- data.frame(a=1:5, b=c(4:5, rand.str, 7:8), stringsAsFactors=FALSE)

in this case proper output is :

proper.output <- data.frame(a=1:5, b=c(4:5, NA, 7:8))

after reading with read.csv each column with at least one character value is treated as character column.

Can I mark errors (random strings) as NAs while reading-in file? If not, what is the most convenient, proper or fastest method for subsetting them with NA's ?

There is na.strings argument in read.csv, but it is the solution only in simpler cases where it can be used like: na.strings=c("-", "unavailable")

(can't see any duplicate, so I guess there is simple, workaround)

colClasses suggestion does not work

read.csv("test.txt", sep=",", colClasses = c("numeric", "numeric"))

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : scan() expected 'a real', got 'chdgfajibe' In addition: Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.txt'

have you tried setting `colClasses=c("numeric")` within `read.csv` ? — Aramis7d, Feb 01 '17 at 13:16

score 1 · Answer 1 · edited May 23 '17 at 10:29

I adapted this solution from a different solution for csv reading which is 7 years back. I thought it is a cleaner solution. It gives your desired output.

setClass("Alpha")
# replacing words with empty characters
setAs("character", "Alpha", 
      function(from) as.numeric(gsub('[[:alpha:]]+', '', from) ) )
read.csv('data.csv', colClasses = c('numeric','Alpha'))

output

Source: How to read data when some numbers contain commas as thousand separator?

score 0 · Answer 2 · answered Feb 01 '17 at 13:10

0

solution is :

wrong.output[] <- lapply(wrong.output, as.numeric)

answered Feb 01 '17 at 13:10

Qbik

5,885
14
62
93

Reading file containing numerical values and unknown errors (random strings) in R

2 Answers2