Change date to the end day of the month by data.table

Question

I have 4190145 observations. And I want to change my date to the end day of the month. I explain in the following.

Here is my part of data:

Time1
2015/01/15
2015/02/24
2015/07/18
2015/11/10
2016/02/20
2016/04/26
2016/08/17

I want to create new column Time2:

Time1           Time2
2015/01/15      2015/01/31
2015/02/24      2015/02/28
2015/07/18      2015/07/31
2015/11/10      2015/11/30
2016/02/20      2016/02/29
2016/04/26      2016/04/30
2016/08/17      2016/08/31

The code is:

data[, Time2 := Time1]
day(data$Time2) <- days_in_month(data$Time1)

However, I got the error.

Error: cannot allocate vector of size N Mb

Therefore, I search my problem on Stack Overflow and find this.

I use gc() but still not work. So I see my sessionInfo():

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lubridate_1.6.0   data.table_1.10.4

loaded via a namespace (and not attached):
 [1] lazyeval_0.2.0 R6_2.2.0       assertthat_0.1 magrittr_1.5   DBI_0.5-1      tools_3.3.3    dplyr_0.5.0    tibble_1.2    
 [9] Rcpp_0.12.9    stringi_1.1.2  stringr_1.1.0

I think it is because my computer just 32-bit. And memory.limit() can just set size to 4000 because my RAM is 4 GB.

Moreover, I figure out that if I use data.table to execute, I can run.

So, my problem is how to change my code from day(data$Time2) <- days_in_month(data$Time1) to data.table form.

Maybe like data[, day(data$Time2) := days_in_month(data$Time1)] ?

I know this is incorrect because I get

Error: Can't assign to the same column twice in the same query (duplicates detected).

Any suggestion?

maybe you can split your data then run your original code and then `rbind` these separate data — , May 23 '17 at 02:29
What about `dt[,Time2:=as.Date(ISOdate(year(Time1), month(Time1), lubridate::days_in_month(Time1)))]`? — HubertL, May 23 '17 at 02:39
First step would be to use a date class, not POSIXlt or whatever is allowing you to use `/` and `day<-`; second, to prioritize making a reproducible example over posting session info and hardware details... — Frank, May 23 '17 at 03:49
See also https://stackoverflow.com/questions/1395229/increasing-or-decreasing-the-memory-available-to-r-processes#comment1236220_1395256 — MichaelChirico, May 23 '17 at 03:53

MichaelChirico · Accepted Answer · 2017-05-23T03:54:15.423

2

Perhaps this works?

set.seed(120340)
NN = 5e6
DT = data.table(Time1 = 
                  sprintf('%04d/%02d/%02d',
                          sample(2000:2017, NN, TRUE),
                          sample(12, NN, TRUE),
                          sample(28, NN, TRUE)))

# potential memory bottleneck
DT[ , c('y', 'm', 'd') := tstrsplit(Time1, '/')]

days_month = data.table(
  month = sprintf('%02d', 1:12),
  days = c(31L, 28L, 31L, 30L, 31L, 30L, 
           31L, 31L, 30L, 31L, 30L, 31L)
)

DT[days_month, d_end := i.days, on = c(m = 'month')]
DT[m == 2L & as.integer(y) %% 4L == 0L, d_end := 29L]
DT[ , Time2 := do.call(paste, c(.SD, list(sep = '/'))), 
    .SDcols = c('y', 'm', 'd_end')]

If that fails, I guess this will be less memory-intensive:

DT[ , y := gsub('/.*', '', Time1)]
DT[ , c('m', 'd') := tstrsplit(Time1, '/')[2L:3L], by = y]

And if that fails, I suggest buying more RAM, or working with a subset of your data before deploying to a more serious machine. Also, as Frank points out, you really want to be storing these as IDates for maximal memory efficiency.

edited May 23 '17 at 03:54

answered May 23 '17 at 03:48

MichaelChirico

33,841
14
113
198

If they're facing a reasonably small date range (like spanning under 1000 years), they could enumerate the end-o-months and use a rolling join, maybe... assuming they actually used dates instead of strings. – Frank May 23 '17 at 03:58
@Frank yea, I thought of that. I figured this way is similar in terms of memory requirements and speed, but easier to see the logic. But joining my `days_month` here to `CJ(yr = unique(DT$y), mo = 1:12)` is another option to try. – MichaelChirico May 23 '17 at 04:00
Yeah, I guess it's just perspective. As someone who hates working with dates / learning their minutiae, I find rolling on `seq(as.IDate("1900-01-01"), as.IDate("2100-01-01"), by="month")-1L` easier to follow. – Frank May 23 '17 at 04:07
1

@Frank ah gotcha. I had something slightly different in mind. Actually 1900 is the one year in there that's annoying because of `as.Date('1900-03-01') - 1L` being unexpected – MichaelChirico May 23 '17 at 04:10
1

actually, `data.table` still has limit. If data is very big, it gets error as well. – May 23 '17 at 06:32

score 0 · Answer 2 · answered Jan 12 '18 at 16:24

For the sake of completeness, here is what I would do using data.table and lubridate:

# create sample data
library(data.table)
set.seed(120340)
NN <- 1e6
DT <- data.table(Time1 = sprintf('%04d/%02d/%02d',
                                 sample(2000:2017, NN, TRUE),
                                 sample(12, NN, TRUE),
                                 sample(28, NN, TRUE)))

Using `ceiling_date()`

library(lubridate)
DT[, Time2 := ceiling_date(ymd(Time1), "month") - 1]
DT

              Time1      Time2
      1: 2005/04/14 2005-04-30
      2: 2007/01/11 2007-01-31
      3: 2014/09/08 2014-09-30
      4: 2017/05/13 2017-05-31
      5: 2008/05/23 2008-05-31
     ---                      
 999996: 2003/06/08 2003-06-30
 999997: 2004/04/12 2004-04-30
 999998: 2009/06/10 2009-06-30
 999999: 2013/02/04 2013-02-28
1000000: 2014/03/05 2014-03-31

tables()

     NAME      NROW NCOL MB COLS        KEY
[1,] DT   1,000,000    2 16 Time1,Time2    
Total: 16MB

Using data.table's assignment operator := updates DT by reference, i.e., without copying the whole data object.

ceiling_date() takes a date-time object and rounds it up to the nearest boundary of the specified time unit, i.e., to the first day of the next month. Therefore, we have to subtract 1 day to get the last day of the actual month.

Reducing memory footprint with `as.IDate()`

data.table has its own date and time classes with integer storage for fast sorting and grouping. Integer storage may require less memory than double on some systems:

DT[, Time1 := as.IDate(Time1, "%Y/%m/%d")]
DT[, Time2 := as.IDate(ceiling_date(ymd(Time1), "month") - 1)]
tables()

     NAME      NROW NCOL MB COLS        KEY
[1,] DT   1,000,000    2  8 Time1,Time2    
Total: 8MB

Here, only half the memory is required.

Change date to the end day of the month by data.table

2 Answers2

Using ceiling_date()

Reducing memory footprint with as.IDate()

Using `ceiling_date()`

Reducing memory footprint with `as.IDate()`