0

I have 4190145 observations. And I want to change my date to the end day of the month. I explain in the following.

Here is my part of data:

Time1
2015/01/15
2015/02/24
2015/07/18
2015/11/10
2016/02/20
2016/04/26
2016/08/17

I want to create new column Time2:

Time1           Time2
2015/01/15      2015/01/31
2015/02/24      2015/02/28
2015/07/18      2015/07/31
2015/11/10      2015/11/30
2016/02/20      2016/02/29
2016/04/26      2016/04/30
2016/08/17      2016/08/31

The code is:

data[, Time2 := Time1]
day(data$Time2) <- days_in_month(data$Time1)

However, I got the error.

Error: cannot allocate vector of size N Mb

Therefore, I search my problem on Stack Overflow and find this.

I use gc() but still not work. So I see my sessionInfo():

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lubridate_1.6.0   data.table_1.10.4

loaded via a namespace (and not attached):
 [1] lazyeval_0.2.0 R6_2.2.0       assertthat_0.1 magrittr_1.5   DBI_0.5-1      tools_3.3.3    dplyr_0.5.0    tibble_1.2    
 [9] Rcpp_0.12.9    stringi_1.1.2  stringr_1.1.0 

I think it is because my computer just 32-bit. And memory.limit() can just set size to 4000 because my RAM is 4 GB.

Moreover, I figure out that if I use data.table to execute, I can run.

So, my problem is how to change my code from day(data$Time2) <- days_in_month(data$Time1) to data.table form.

Maybe like data[, day(data$Time2) := days_in_month(data$Time1)] ?

I know this is incorrect because I get

Error: Can't assign to the same column twice in the same query (duplicates detected).

Any suggestion?

Uwe
  • 41,420
  • 11
  • 90
  • 134
Peter Chen
  • 1,464
  • 3
  • 21
  • 48
  • 2
    maybe you can split your data then run your original code and then `rbind` these separate data –  May 23 '17 at 02:29
  • 1
    What about `dt[,Time2:=as.Date(ISOdate(year(Time1), month(Time1), lubridate::days_in_month(Time1)))]`? – HubertL May 23 '17 at 02:39
  • 1
    First step would be to use a date class, not POSIXlt or whatever is allowing you to use `/` and `day<-`; second, to prioritize making a reproducible example over posting session info and hardware details... – Frank May 23 '17 at 03:49
  • See also https://stackoverflow.com/questions/1395229/increasing-or-decreasing-the-memory-available-to-r-processes#comment1236220_1395256 – MichaelChirico May 23 '17 at 03:53

2 Answers2

2

Perhaps this works?

set.seed(120340)
NN = 5e6
DT = data.table(Time1 = 
                  sprintf('%04d/%02d/%02d',
                          sample(2000:2017, NN, TRUE),
                          sample(12, NN, TRUE),
                          sample(28, NN, TRUE)))

# potential memory bottleneck
DT[ , c('y', 'm', 'd') := tstrsplit(Time1, '/')]

days_month = data.table(
  month = sprintf('%02d', 1:12),
  days = c(31L, 28L, 31L, 30L, 31L, 30L, 
           31L, 31L, 30L, 31L, 30L, 31L)
)

DT[days_month, d_end := i.days, on = c(m = 'month')]
DT[m == 2L & as.integer(y) %% 4L == 0L, d_end := 29L]
DT[ , Time2 := do.call(paste, c(.SD, list(sep = '/'))), 
    .SDcols = c('y', 'm', 'd_end')]

If that fails, I guess this will be less memory-intensive:

DT[ , y := gsub('/.*', '', Time1)]
DT[ , c('m', 'd') := tstrsplit(Time1, '/')[2L:3L], by = y]

And if that fails, I suggest buying more RAM, or working with a subset of your data before deploying to a more serious machine. Also, as Frank points out, you really want to be storing these as IDates for maximal memory efficiency.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • If they're facing a reasonably small date range (like spanning under 1000 years), they could enumerate the end-o-months and use a rolling join, maybe... assuming they actually used dates instead of strings. – Frank May 23 '17 at 03:58
  • @Frank yea, I thought of that. I figured this way is similar in terms of memory requirements and speed, but easier to see the logic. But joining my `days_month` here to `CJ(yr = unique(DT$y), mo = 1:12)` is another option to try. – MichaelChirico May 23 '17 at 04:00
  • Yeah, I guess it's just perspective. As someone who hates working with dates / learning their minutiae, I find rolling on `seq(as.IDate("1900-01-01"), as.IDate("2100-01-01"), by="month")-1L` easier to follow. – Frank May 23 '17 at 04:07
  • 1
    @Frank ah gotcha. I had something slightly different in mind. Actually 1900 is the one year in there that's annoying because of `as.Date('1900-03-01') - 1L` being unexpected – MichaelChirico May 23 '17 at 04:10
  • 1
    actually, `data.table` still has limit. If data is very big, it gets error as well. –  May 23 '17 at 06:32
0

For the sake of completeness, here is what I would do using data.table and lubridate:

# create sample data
library(data.table)
set.seed(120340)
NN <- 1e6
DT <- data.table(Time1 = sprintf('%04d/%02d/%02d',
                                 sample(2000:2017, NN, TRUE),
                                 sample(12, NN, TRUE),
                                 sample(28, NN, TRUE)))

Using ceiling_date()

library(lubridate)
DT[, Time2 := ceiling_date(ymd(Time1), "month") - 1]
DT
              Time1      Time2
      1: 2005/04/14 2005-04-30
      2: 2007/01/11 2007-01-31
      3: 2014/09/08 2014-09-30
      4: 2017/05/13 2017-05-31
      5: 2008/05/23 2008-05-31
     ---                      
 999996: 2003/06/08 2003-06-30
 999997: 2004/04/12 2004-04-30
 999998: 2009/06/10 2009-06-30
 999999: 2013/02/04 2013-02-28
1000000: 2014/03/05 2014-03-31
tables()
     NAME      NROW NCOL MB COLS        KEY
[1,] DT   1,000,000    2 16 Time1,Time2    
Total: 16MB

Using data.table's assignment operator := updates DT by reference, i.e., without copying the whole data object.

ceiling_date() takes a date-time object and rounds it up to the nearest boundary of the specified time unit, i.e., to the first day of the next month. Therefore, we have to subtract 1 day to get the last day of the actual month.

Reducing memory footprint with as.IDate()

data.table has its own date and time classes with integer storage for fast sorting and grouping. Integer storage may require less memory than double on some systems:

DT[, Time1 := as.IDate(Time1, "%Y/%m/%d")]
DT[, Time2 := as.IDate(ceiling_date(ymd(Time1), "month") - 1)]
tables()
     NAME      NROW NCOL MB COLS        KEY
[1,] DT   1,000,000    2  8 Time1,Time2    
Total: 8MB

Here, only half the memory is required.

Uwe
  • 41,420
  • 11
  • 90
  • 134