3

I have a 5GB file (19M rows and 16 columns) that I am working on. One of the fields in this file is in YearQtr format. Example 2014Q1. I followed Extract year from date thread to extract year information as:

library(zoo)
x <- "2014Q1"
d <- as.factor(format(as.yearqtr(x), "%Y"))

While this works, but because I have about 19M rows, it takes forever for RStudio to process this. For instance, it takes about 45 seconds for fread to read the files, but 10 minutes to extract the year! Is there anyway I can make this work faster? I'd appreciate any thoughts. I even tried as.Date() but there was no improvement. Any thoughts?

Community
  • 1
  • 1
watchtower
  • 4,140
  • 14
  • 50
  • 92
  • 3
    `substr(date,1,4)`? (never name your variables after a function) – Jaap Sep 28 '16 at 22:01
  • @ProcrastinatusMaximus - so, I ran substr(date,1,4), and it was fast (i.e. took about a minute), but the conversion of year to `factor` takes a lot of time. Do you know whether there is any way I can expedite creating factors? Thanks in advance for your help. – watchtower Sep 28 '16 at 22:26
  • 1
    For 100 million observations, I get a time of about 3 seconds: `x = sample(as.character(1900:2000), 1e8, replace = TRUE); system.time(as.factor(x))`. By the way, if you're using data.table, it is possibly more efficient to use `substr` by groups (since you will have many more obs than years). – Frank Sep 28 '16 at 22:31
  • @Frank. This is interesting. I am not sure but I tried to save my data in .RData, and the total size is 406MB (disk space) (after clearing all variables and only keeping the data frame). When I ran your query in a fresh session, I was also able to run the query in about 4 seconds, same as yours. Do you need any information about my system? – watchtower Sep 28 '16 at 22:45
  • 1
    Hm, I don't think I need system info, nor that I'd be savvy enough to know what to do with it :) If you can find a way to build a reproducible example, though, you could edit it into the question. – Frank Sep 28 '16 at 22:47
  • You could provide the random dummy data of your size, it would help us to address your problem - and provide an answer. You may want to check internal (and local) function in data.table [`as_factor`](https://github.com/Rdatatable/data.table/blob/77956204cc267c3b0581db9db6b970b9a19d7253/R/fread.R#L157), which uses fast data.table ordering to create factor, it should speed up your code, you need to copy it and use `data.table:::forderv`. Feel free to self-answer your question with the code, and if possible timings. – jangorecki Sep 28 '16 at 23:09
  • @jangorecki Sure, I can provide dummy data. I am an absolute beginner in R and have only been using "actively" for about 3 weeks. Do you mind pointing me to how I can pull dummy data? I googled "dummy data in R" and found a bunch of books for dummies. I'd appreciate your thoughts. – watchtower Sep 29 '16 at 00:40
  • 2
    For making dummy data, see [*"How to make a great R reproducible example"*](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – Jaap Sep 29 '16 at 07:55

0 Answers0