4

I have data.frames with character columns containing numbers (like '0123', '1234' etc). When I write them to csv and read them back, they end up as numeric columns. The write.csv and read.csv functions have quote arguments, and by default should quote character strings on output and respect them on input, so this behavior is unexpected.

How can I avoid this, without manually specifying colClasses when I read the file back in?

Reproducible example:

# dummy data
fake_data <- 
  data.frame(num=1:25, char=letters[1:25], charnum=as.character(1:25),
             stringsAsFactors=F)

# check out col classes - all good
sapply(fake_data, class)

#       num        char     charnum 
# "integer" "character" "character" 

# write it to a file and read it back
fpath <- '~/Desktop/fake_data.csv'
write.csv(fake_data, fpath, row.names=F)
fake_data2 <- read.csv(fpath, stringsAsFactors=F)

# but now look, different classes!
sapply(fake_data2, class)

#       num        char     charnum 
# "integer" "character"   "integer"

It seems like the error is on the read side, since the file is being written with quotes.

> cat(readLines(fpath))
"num","char","charnum" 1,"a","1" 2,"b","2" 3,"c","3" 4,"d","4" 5,"e","5" 6,"f","6" 7,"g","7" 8,"h","8" 9,"i","9" 10,"j","10" 11,"k","11" 12,"l","12" 13,"m","13" 14,"n","14" 15,"o","15" 16,"p","16" 17,"q","17" 18,"r","18" 19,"s","19" 20,"t","20" 21,"u","21" 22,"v","22" 23,"w","23" 24,"x","24" 25,"y","25"

sessionInfo:

R version 3.1.1 (2014-07-10) | Platform: x86_64-apple-darwin13.1.0 (64-bit)

arvi1000
  • 9,393
  • 2
  • 42
  • 52
  • This question (http://stackoverflow.com/questions/22923756/r-respect-quotes-around-numbers-treat-as-character-with-read-csv) is closely related, but got little attention and doesn't have a solution posted that applies to my question here as stated; I can't specify all colClasses as character since I have some numeric fields – arvi1000 Dec 11 '14 at 21:20
  • @user20650, really? read.csv calls read.table. And when I run `read.table(fpath, header=T, sep=',', stringsAsFactors=F)` I get the same issue – arvi1000 Dec 11 '14 at 21:24
  • @arvi1000; yup youre right – user20650 Dec 11 '14 at 23:02

5 Answers5

4

Thanks for the answers. Looking at this further, I have the following to add.

Option 1: just use data.table::fread -- works as I would like

Option 2: Do this to construct a colClasses strings

 # read header and first data line
 first_data_line <- strsplit(readLines(fpath, n=2L)[2], ',')[[1]]

 # find which fields have double quotes
 char_fields <- grep('"', first_data_line)

 # construct colClasses vec
 cc <- rep(NA, length(first_data_line))
 cc[char_fields] <- 'character'

Since I'm a fan of data.table anyway, #1 is probably what I'll do.

arvi1000
  • 9,393
  • 2
  • 42
  • 52
1

Adding quote="" seems to achieve what you were looking for:

sapply( read.csv(fpath, stringsAsFactors=F), class)
sapply( read.csv(fpath, quote="", stringsAsFactors=F), class)
renato vitolo
  • 1,744
  • 11
  • 16
  • That is something, thank you. But it has the side effect of not reading in the data correctly, because the quotes become part of the string and the column names are messed up. One could repair the results with gsub, but still a hacky workaround to what seems like a basic feature of read.csv! – arvi1000 Dec 11 '14 at 21:49
  • Just found out that if `colClasses` is set to `NA` then `read.table` will always try to convert columns into logical, integer, etc. by calling `type.convert()`; therefore, I see no other way than explicitly specifying `colClasses`. – renato vitolo Dec 12 '14 at 08:10
  • Yes, that's my conclusion too. Thank you for answering! – arvi1000 Dec 12 '14 at 14:28
1

You can use the quote parameter to control how read.csv deals with quotation marks. If you set quote="" then you can deal with them in R:

gsub.remove.quotes <- function(x) 
{
      if (is.character(x))
            return(gsub('"$','',gsub('^"','',x)))
      else
            return(x)
}
fake_data2 <- read.csv(fpath, stringsAsFactors=F,quote="")
fake_data3 <- data.frame(lapply(fake_data2,gsub.remove.quotes),stringsAsFactors = F)
sapply(fake_data3, class)

Note that the quote-stripping function is needed as a second step, because the character variables in fake_data2 actually have the quotation marks in them.

jafelds
  • 894
  • 8
  • 12
1

The discussions here are great but could be confusing for someone who simply wants to keep numbers as characters and needs a quick answer (like me). Sorry I did not notice the precondition in the question that "without manually specifying colClasses". But neither did google since this page is on the top of google search, and I cannot believe it took me so much time to figure it out.

The correct answer is in the comments part of earlier answers: "There is no other way than explicitly specifying colClasses" in the option of read.csv. It is also the easiest way to get numbers as characters (or factors). It is not about write.csv, stringsAsFactors or quotes. And this question discussed colClasses in details: Specifying colClasses in the read.csv.

Another easy approach is to modify the original numbers a little bit, 0001 into _0001 for example. In my case, it can also save the trouble.

Yang Liu
  • 521
  • 6
  • 4
0

Using data.table for the write and read did not work for me.

Write.csv() or data.table::fwrite() writes a character string that looks like a number as a number. For example, when character string '001234' is written using write.csv() or data.table::fwrite() and the file then read using read.csv() or data.table::fread() in r (or as a csv in Excel), the former character string now reads as '1234' and has class numeric. Using R's native format, .Rds, preserves the variable class. The equivalent commands are saveRDS() and readRDS(). Hopefully this helps someone else.

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
Philip0505
  • 41
  • 5