1

I have a data frame 4 million rows and 1.4 million distinct values of a grouping variable. Sample DF looks like this:

> df
        date        id
1 2015-06-25   4333864
2 2015-06-25   3867895
3 2015-06-25   4333866
4 2015-06-25   4333868
5 2015-06-29   2900522
6 2015-06-29   3609093

Using this command to do lagged date differences crashes R on an 8GB memory MAC:

df %>% group_by(id) %>% mutate(dayDiff = date - lag(date))

Is this dplyr being memory hungry? Any other efficient way to accomplish what I need?

Here is the version of dplyr I am using:

Package: dplyr
Type: Package
Version: 0.4.1

Date frame has the following variable types:

> str(df)
'data.frame':   6 obs. of  2 variables:
 $ date: Date, format: "2014-07-01" "2014-07-01" "2014-07-01" ...
 $ id  : num  1793096 2019424 1869572 1869573 1774661 ...
Gopala
  • 10,363
  • 7
  • 45
  • 77

0 Answers0