2

I apologize that I cannot really create a reproducible example (or I guess at least not according to the rules) but still hope for help. I am using the data from here: American Housing Survey 2013 data

Since the data files are quite big I would like to use the "fread" command instead of the "read.csv" command. With read.csv I could just do the following:

homimp <- read.csv("homimp.csv", quotes = "'")
head(homimp)
       CONTROL RAS RAH  RAD JRAS JRAD
1 100003130103  74   2   96   -9    9
2 100006110249  35   2 8358   -9    9
3 100006110249  36   2 5970   -9    9
4 100006110249  37   2 6567   -9    9
5 100006110249  40   2  716   -9    9
6 100006110249  45   2 1910   -9    9

and it would remove the quotes (note that one column (RAD) is not in quotes in the first place) However, if I read with fread I do not seem to be able to remove the quotes The quote argument returns an error:

homimpdt <- fread("homimp.csv", quote = "'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")

And without the argument quotes are not removed:

homimpdt <- fread("homimp.csv")
head(homimpdt)
          CONTROL  RAS RAH  RAD JRAS JRAD
1: '100003130103' '74' '2'   96 '-9'  '9'
2: '100006110249' '35' '2' 8358 '-9'  '9'
3: '100006110249' '36' '2' 5970 '-9'  '9'
4: '100006110249' '37' '2' 6567 '-9'  '9'
5: '100006110249' '40' '2'  716 '-9'  '9'
6: '100006110249' '45' '2' 1910 '-9'  '9'

Why I want to do this:

> system.time(newhouse <- read.csv('newhouse.csv', quote = "'"))
   user  system elapsed 
  24.86    0.68   25.77 
> system.time(newhousedt <- fread('newhouse.csv'))
Read 84355 rows and 760 (of 760) columns from 0.273 GB file in 00:00:04
   user  system elapsed 
   3.33    0.07    3.41 

Thank you very much for your help!

Ad Psidom's comment:

homimpdt <- fread("homimp.csv", quote = "\'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")
Daniel Winkler
  • 487
  • 3
  • 11
  • Set `quote = "\'"`. – Psidom Sep 18 '16 at 15:12
  • @ Psidom: Thanks for your answer but unfortunately it also returns an error. – Daniel Winkler Sep 18 '16 at 15:14
  • 2
    There is no `quote` parameter for `data.table::fread()` unless you use 1.9.7 from github but this file is ~67K lines. That's not a big file and really doesn't warrant the use of `fread()` for "speed" – hrbrmstr Sep 18 '16 at 15:15
  • Oh I forget that I am using `data.table` 1.9.7. – Psidom Sep 18 '16 at 15:15
  • @ hrbrmstr Thanks for your answer! Is there anything I could use instead to remove the quotes or is this simply not possible with fread? – Daniel Winkler Sep 18 '16 at 15:17
  • @ Psidom: I am using 1.9.6 (This is what Rstudio installed). So I guess the update will fix that? – Daniel Winkler Sep 18 '16 at 15:19
  • install the github version of `data.table` if you really want to but none of those CSV files are large. Even `person.csv` is small data. – hrbrmstr Sep 18 '16 at 15:19
  • You can update your data.table package, as my testing goes, it works with `1.9,7`, it is still under development though, so use it with caution. – Psidom Sep 18 '16 at 15:21
  • Alright thank you both! It would just save me some time. ;) (edit: I guess if its still in dev I will wait for the update. I value the stability of the package more than my time) btw: is there any way I can give you credit for your answers? I understand usually one can check an answer as correct? – Daniel Winkler Sep 18 '16 at 15:22
  • @DanielWinkler installing 1.9.7 is pretty simple `install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")`, in case of any problems [here](https://github.com/Rdatatable/data.table/wiki/Installation) is full description. – jangorecki Sep 18 '16 at 15:43
  • You might try http://stackoverflow.com/questions/29499145/preventing-column-class-inference-in-fread/29499512#29499512 if you are on a linux-based system. Actually, the `type.convert` part (option 2) would probably work here as well. – Rich Scriven Sep 18 '16 at 15:57
  • @jangorecki Thank you very much for the instructions! Seems fairly straightforward! :) – Daniel Winkler Sep 18 '16 at 16:39
  • @RichScriven Thank you! I am currently on Windows 10 but I'll try when I get to my Linux PC (unfortunately I am sharing this code with Windows-only users). As for option two: as.numeric introduces NAs and type.convert makes the variable a factor but I will toy around a bit with that. – Daniel Winkler Sep 18 '16 at 16:50
  • 1
    No worries. `as.is = TRUE` in `type.convert()` will prevent factor coercion. – Rich Scriven Sep 18 '16 at 16:51
  • Now that when it looks like that you have got solution running; consider changing the question. Plus if anyone else (among commenters) don't do it consider wrapping up the question with summary of the comments above :) – abhiieor Sep 18 '16 at 17:22
  • Thank you @RichScriven and @abhiieor!! Is it possible to flag the question as solved even though there are no "answers" but only comments? Should I sum up the solution in the question or "answer" it with the solution? – Daniel Winkler Sep 19 '16 at 15:43
  • Glad to help! Yes, you could write up your own answer based on the information we helped you gather. It's perfectly acceptable to do that. – Rich Scriven Sep 19 '16 at 16:38

1 Answers1

0

Summary of the answers given in comments:

Solution #1: Thanks to @Psidom and @jangorecki

Install data.table v. 1.9.7:

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

Then run:

homimpdt <- fread("homimp.csv", quote = "\'")

EDIT: Current version of data.table on CRAN is 1.9.6

Solution #2 (linux only): thanks to @RichScriven

can be found here: Preventing column-class inference in fread()

and set as.is = TRUE in the type.convert() function

Community
  • 1
  • 1
Daniel Winkler
  • 487
  • 3
  • 11