1

I was doing some research on which import method is better, read.csv or read_csv. There were several threads comparing the import times etc., and most point to using read_csv for larger files (also fread).

While importing data, I came across an unusual situation.

I used read.csv and read_csv to import the same csv file

CSV1 <- read.csv("C:\\Users\\AH0168850\\Desktop\\Claims.csv")
CSV2 <- read_csv("C:\\Users\\AH0168850\\Desktop\\Claims.csv")

class(CSV1$claim_amount)
class(CSV2$claim_amount)

CSV1$claim_amount <- as.numeric(CSV1$claim_amount)
CSV2$claim_amount <- as.numeric(CSV2$claim_amount)
CSV2$claim_amount <- as.numeric(sub('\\$','',CSV2$claim_amount))

Claim_amount has $ values. When I use read.csv, claim_amount is categorized as factor, which read_csv categories it as character.

On doing an as.numeric to convert the column to numeric, data imported using read.csv goes through without any issue. However, data imported using read_csv converts all values to NA with a warning "NAs introduced by coercion"

To successfully convert the read_csv data I had to use a substitution method before using as.numeric. There are several threads that highlight use of similar functions

e.g.: http://r.789695.n4.nabble.com/Converting-dollar-value-factors-to-numeric-td2130536.html

https://www.rforexcelusers.com/remove-currency-dollar-sign-r/

However, I couldn't find any that give an explanation of why this happens. I did read that read.csv forces a factor for character variables, but I am not sure why that would make a difference in using as.numeric.

Sundararaj Govindasamy
  • 8,180
  • 5
  • 44
  • 77
  • Can you make a reproducible example? If I try with `x <- as.factor(c("$100", "$999", "$111"))` and then convert it with `as.numeric(x)`, I don't get the correct result..I'm following your example, are you sure that `as.numeric()` on the factor gives the correct result? – RLave Dec 19 '18 at 16:13
  • Also this might be related: https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-integer-numeric-without-loss-of-information – RLave Dec 19 '18 at 16:14
  • 1
    Yes, `x <- as.factor(c("$100", "$999", "$111"))` doesn't give correct results. However, when I load data using read.csv, and then use as.numeric, it gives the correct results. – Sandeep Warrier Dec 20 '18 at 15:27
  • I use the following example: Col1 = claim_amount, Col2 = total_policy_claims, Col3 = fraudulent claim_amount values = $2,980, $2,980, $3,369.50, $1,680 and $2,680 total_policy_claims values = 1, 3, 1, 1 and 1 fraudulent = No, No, Yes, No and No I import this csv using the read.csv function `class(test1$claim_amount)` gives factor as result after using `test1$claim_amount <- as.numeric(test1$claim_amount)` class gives numeric Using read_csv gives `class(test2$claim_amount)` as character Now, as.numeric will not work. It gives a warning - NA introduced by coercion – Sandeep Warrier Dec 20 '18 at 15:39

0 Answers0