6

Today I have finally decided to start climbing R's steep learning curve. I have spent a few hours and I managed to import my dataset and do a few other basic things, but I am having trouble with the data type: a column which contains decimals is imported as integer, and conversion to double changes the values.

In trying to get a small csv file to put here as an example I discovered that the problem only happens when the data file is too large (my original file is a 1048418 by 12 matrix, but even with "only" 5000 rows I have the same problem. When I only have 100, 1000 or even 2000 rows the column is imported correctly as double).

Here is a smaller dataset (still 500kb, but again, if the dataset is small the problem is not replicated). The code is

> ex <- read.csv("exampleshort.csv",header=TRUE)
> typeof(ex$RET)
[1] "integer"

Why is the column of returns being imported as integer when the file is large, when it is clearly of the type double?

The worst thing is that if I try to convert it to double, the values are changed

> exdouble <- as.double(ex$RET)
> typeof(exdouble)
[1] "double"

> ex$RET[1:5]
[1] 0.005587  -0.005556 -0.005587 0.005618  -0.001862
2077 Levels: -0.000413 -0.000532 -0.001082 -0.001199 -0.0012 -0.001285 -0.001337 -0.001351 -0.001357 -0.001481 -0.001486 -0.001488 ... 0.309524

> exdouble[1:5]
[1] 1305  321  322 1307   41

This is not the only column that is imported wrong, but I figured that if I find a solution for one column, I should be able to sort the other ones out. Here is some more information:

> sapply(ex,class)
PERMNO      DATE    COMNAM     SICCD       PRC       RET      RETX    SHROUT    VWRETD    VWRETX    EWRETD    EWRETX 
"integer" "integer"  "factor" "integer"  "factor"  "factor"  "factor" "integer" "numeric" "numeric" "numeric" "numeric" 

They should be in this order: integer, date, string, integer, double, double, double, integer, double, double, double, double (the types are probably wrong, but hopefully you will get what I mean)

Vivi
  • 4,070
  • 6
  • 29
  • 44
  • @Xu Wang: the first half won't work. Cutting it down to the first 5 thousand observations, less than 1% of my data, already creates problems... – Vivi Dec 05 '11 at 06:51
  • sorry that I didn't finish my comment because I went and read the `read.csv` help. What I wanted to say was that I thought there were maybe some strange values that confused `R`. So I thought that it wasn't the fact of large or small but rather that the large dataset has one of those confusing characters or values. Does that make sense? If not, it doesn't matter. I think the solution is to use the colClasses argument. – Xu Wang Dec 05 '11 at 06:55
  • @Xu Wang I understand what you are saying, but I am still not quite sure of how to solve my problem. How do I use the colClasses argument? Would you be able to give me the one line command to import this file correctly using the colClasses argument? – Vivi Dec 05 '11 at 06:58
  • 1
    sure we can figure this out! Please see my comment in the answer. I need some other information from you. – Xu Wang Dec 05 '11 at 07:00

1 Answers1

7

See the help for read.csv: ?read.csv. Here is the relevant section:

colClasses: character.  A vector of classes to be assumed for the
          columns.  Recycled as necessary, or if the character vector
          is named, unspecified values are taken to be ‘NA’.

          Possible values are ‘NA’ (the default, when ‘type.convert’ is
          used), ‘"NULL"’ (when the column is skipped), one of the
          atomic vector classes (logical, integer, numeric, complex,
          character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
          Otherwise there needs to be an ‘as’ method (from package
          ‘methods’) for conversion from ‘"character"’ to the specified
          formal class.

          Note that ‘colClasses’ is specified per column (not per
          variable) and so includes the column of row names (if any).

Good luck with your quest to learn R. It's difficult, but so much fun after you get past the first few stages (which I admit do take some time).

try this and fix the others accordingly:

ex <- read.csv("exampleshort.csv",header=TRUE,colClasses=c("integer","integer","factor","integer","numeric","factor","factor","integer","numeric","numeric","numeric","numeric"), na.strings=c("."))

As BenBolker points out, the colClasses argument is probably not needed. However, note that using the colClasses argument can make the operation faster, especially with a large dataset.

na.strings must be specified. See the following section in ?read.csv:

 na.strings: a character vector of strings which are to be interpreted
      as ‘NA’ values.  Blank fields are also considered to be
      missing values in logical, integer, numeric and complex
      fields.

For reference purposes (this should not be used as the solution because the best solution is to import the data correctly in one step): RET was not imported as an integer. It was imported as a factor. For future reference, if you want to convert a factor to a numeric, use

new_RET <-as.numeric(as.character(ex$RET))

Xu Wang
  • 10,199
  • 6
  • 44
  • 78
  • I had read this part of the help, but I honestly don't understand what this all means (I only started using R today). That column only has values that are either 0 or double, and there are no missing values. – Vivi Dec 05 '11 at 06:54
  • Ah, ok. What are the other columns in your dataset supposed to be? Do they import ok? Could you post the output of `sapply(ex,class)`. – Xu Wang Dec 05 '11 at 06:57
  • I added the information you requested to the end of my question – Vivi Dec 05 '11 at 07:01
  • I get an error: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '.' (there is also a parenthesis missing at the end of the command) – Vivi Dec 05 '11 at 07:10
  • ah, the '.' means that that's how missing values are denoted in the csv. `R` doesn't understand that so you have to tell it. – Xu Wang Dec 05 '11 at 07:11
  • And for the new_RET command I get: Warning message: NAs introduced by coercion (but it works, it converts it to double correctly!!! Yay!! Many thanks!) – Vivi Dec 05 '11 at 07:13
  • "NAs introduced by coercion" is normal. You have NA's. NA is kind of how R denotes missing values. If you look at the conversion pre and post, the "." were probably converted to NA. – Xu Wang Dec 05 '11 at 07:14
  • There is a parenthesis missing somewhere. When I add it to the end it still gives me the same error. I also checked the file and there are no missing values in that column (and I can't see a missing value anywhere else either!) – Vivi Dec 05 '11 at 07:19
  • Sorry, try the command again. The parenthesis needed to be added in the middle somewhere. – Xu Wang Dec 05 '11 at 07:23
  • (I found the problem. There are a few cells with the value as ".". You were right....) – Vivi Dec 05 '11 at 07:23
  • That's good because I thought I was crazy for a second. But don't delete them in the csv manually (even though I bet you're tempted to do so). The new command should work. – Xu Wang Dec 05 '11 at 07:25
  • 1
    Awesome, my pleasure! This was actually a very good (although painful) lesson in R. Also, I know the help files can be confusing, but they're really good. Try to read through them and when you get stuck, feel free to ask questions like "what does a factor mean" in R? Also, there are great books and free introductions out there. I would recommend working through one. Good luck! – Xu Wang Dec 05 '11 at 07:26
  • 2
    I think `typeof` is confusing you. `class(ex$RET)` might have gotten you to the answer sooner ... I think you don't even need `colClasses`, just the `na.strings` argument. `ex <- read.csv("exampleshort.csv",header=TRUE,na.strings=".")` seemed to work for me. – Ben Bolker Dec 05 '11 at 13:21
  • @BenBolker Good point, I bet you're right! I'll update the answer – Xu Wang Dec 05 '11 at 17:57