How to reference the current line while creating a new variable in a datatable in R

Question

I think I have been unconsciously roaming around this question for some time.

Let me try to explain it clearly. Suppose I have a datatable such as this one :

library(tidyverse)
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics) %>%
  .[,c("pce","psavert","uempmed","unemploy"):=NULL]

> MWE
           date      pop
  1: 1967-07-01 198712.0
  2: 1967-08-01 198911.0
  3: 1967-09-01 199113.0
  4: 1967-10-01 199311.0
  5: 1967-11-01 199498.0
 ---                    
570: 2014-12-01 319746.2
571: 2015-01-01 319928.6
572: 2015-02-01 320074.5
573: 2015-03-01 320230.8
574: 2015-04-01 320402.3

There are several things I want to do that imply creating new variables that for a give row i depends on other variables at this row.

For instance, if I want the evolution of population for one month to another, what I want is something like :

MWE2 <- MWE[, MoM :=pop[date=i$date]/pop[date=(i$date-month(1))]

The same if I want a Year on year evolution :

MWE3 <- MWE[, YoY :=pop[date=i$date]/pop[date=(i$date-year(1))]

And if for instance I want to create the difference between the current value and the average of the population for the same month for the 5 previous years, it would be someting like

MWE4 <- MWE[, 5YD :=pop- mean(pop[month(date)=month(i$date) & (year(date$i)-6 <= year(date) < year(date$i))])]

So is there a way to do something close to what I envision ? Do not hesitate to tell me if I have been unclear

Mike · Answer 1 · 2020-11-24T15:51:01.753

1

The code chunk below will get you your data sets: You can use mostly dplyrs lag function for the MoM and YoY variables. MWE2 is a way to get the month over month ration, MWE3 gets the YoY ratio and the last 3 data sets get the 5 year lag and get the difference in the population. The 5 year code can be changed for whatever year you need it to be.

library(tidyverse)
    library(data.table)
    library(lubridate)
MWE <- as.data.table(ggplot2::economics) %>%
  .[,c("pce","psavert","uempmed","unemploy"):=NULL]
MWE2 <- MWE %>% 
        arrange(date) %>% 
        mutate(
               lagpop = lag(pop),
               MoM = pop/lagpop)



MWE3 <- MWE %>% 
  mutate(yeard = year(date)) %>% 
  group_by(yeard) %>% 
  summarise(avgpop = mean(pop,na.rm = TRUE)) %>% 
  ungroup %>% 
  arrange(yeard) %>% 
  mutate(lagpop = lag(avgpop),
         YoY = avgpop/lagpop)


MWEa <- mutate(MWE, date5 = date+(365*5)+2 ,
               yeard = year(date5),
               monthd = month(date5))

MWE <- mutate(MWE, 
              yeard = year(date),
              monthd = month(date))

MWE4 <- left_join(MWE,MWEa , by = c("yeard","monthd")) %>% 
        mutate(diff5yr = pop.x - pop.y)

edited Nov 24 '20 at 15:51

answered Nov 24 '20 at 15:36

Mike

3,797
1
11
30

I see two (minor) downsides with the use of lag : one is that I get values from firt date of each group that does not mean anything, second is that I find it a bit more complicated to be sure to order properly when I have a lot more variables. I am not with R anymore, but I believe `MWE3` calculates the average over the whole year, whereas I would have liked to have a montly growh from the same month the year before – Anthony Martin Nov 24 '20 at 16:07
true, yea the the first value of the group will be NA but the value can be changed. `MWE3` would calculate for the whole year, did you want a rolling year average? did you move this out of `R` and into another software? – Mike Nov 24 '20 at 17:13
I was referring to the fact that I have been on a computer without R installed, not that I have changed programming language My point, but I might be mistaken, is precisely that the first values for each group (In my real data I have several groups) will not be `NA` but will have a nonsensical value. And yes I guess I meant rolling year average – Anthony Martin Nov 24 '20 at 19:20
ok no problem, for the first point above if you add a `group_by(your_group)` before the arrange that makes `MWE2` the first group value will be NA. here is a good resource on rolling averages: https://stackoverflow.com/questions/26198551/rolling-mean-moving-average-by-group-id-with-dplyr – Mike Nov 25 '20 at 13:51

score 1 · Answer 2 · answered Nov 24 '20 at 15:38

Back to basics.

Crude solution, however it is straightforward.

library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics)
MWE <- MWE[,c("pce","psavert","uempmed","unemploy"):=NULL]
data = data.frame(MWE)
lubridate::month(MWE$date[1])


monthly_diffs = double()
yearly_diffs = double()
current_year_months = 0
current_year_sum = 0
past_year_sum = 0
past_year_months = 0


for (i in 1:(nrow(data)-1)) {
  #since your data appears to be on a monthly basis, we can do the difference directly
  monthly_diffs[i] <- data$pop[i + 1] / data$pop[i]
  #for the years there is a need to accumulate it first.
  if (current_year_months == 0) {
    current_year <- year(MWE$date[i])
  }
  
  if (year(MWE$date[i]) == current_year) {
    #if we are within one year, we accumulate
    current_year_months <- current_year_months + 1
    current_year_sum <- current_year_sum + MWE$pop[i]
  } else {
    #otherwise we compute the averages, the ratio and save it
    if (past_year_months != 0) {
       #I will assume you want the average difference over the year
       #as I find it a reasonable approach to a yearly unemployment rates.
       #any other operation would be similar.
       current_year_average <- current_year_sum / current_year_months
       past_year_average <- past_year_sum / past_year_months 
       yearly_diffs = c(yearly_diffs, current_year_average / past_year_average)
       
    } 
    past_year_months = current_year_months
    past_year_sum = current_year_sum
    current_year_months = 0
    current_year_sum = 0
    current_year = year(MWE$date[i])
  }
}

How to reference the current line while creating a new variable in a datatable in R

2 Answers2

Back to basics.