2

Apparently the update() method cannot retrieve the dataset the estimation was based on if I wrap the estimation function in another function. Is there any way around this, e.g., by specifying an environment?

library(fixest)
data(trade)

# fit model directly and wrapped into function
mod1 <- fepois(Euros ~ log(dist_km) | Origin + Destination, trade)

fit_model <- function(df) {
  fepois(Euros ~ log(dist_km) | Origin + Destination, data = df)
}

mod2 <- fit_model(trade)

# try to update
update(mod1, . ~ . + log(Year))
#> Poisson estimation, Dep. Var.: Euros
#> Observations: 38,325 
#> Fixed-effects: Origin: 15,  Destination: 15
#> Standard-errors: Clustered (Origin) 
#>              Estimate Std. Error  t value  Pr(>|t|)    
#> log(dist_km) -1.51756   0.113171 -13.4095 < 2.2e-16 ***
#> log(Year)    72.36888   6.899699  10.4887 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -1.212e+12   Adj. Pseudo R2: 0.592897
#>            BIC:  2.424e+12     Squared Cor.: 0.384441
update(mod2, . ~ . + log(Year))
#> Error in fepois(fml = Euros ~ log(dist_km) + log(Year) | Origin + Destination, : Argument 'data' must be either: i) a matrix, or ii) a data.frame.
#> Problem: it is not a matrix nor a data.frame (instead it is a function).

Created on 2023-02-26 with reprex v2.0.2

Also posted as a GitHub issue.

Update: The solution seems to be forcing an early evaluation of the expression that refers to the dataset. Another way is to specify the dataset again within update():

update(mod2, . ~ . + log(Year), data = trade)
dufei
  • 2,166
  • 1
  • 7
  • 18
  • Thanks for your replies! I'll try to understand the issue better before I pick an answer - perhaps the community has a preference, too. – dufei Mar 03 '23 at 11:08

3 Answers3

2

If you want to pass an arbitrary df into the function and not hard code trade we have to evaluate it early before calling fepois(). We can do this with eval(bquote()) and wrap the data argument (below mydat) into .(). To capture the object name nicely, we can further wrap the data argument in substitute() before evaluating it early (thanks for the comment from @jay.sf).

Update: I now added an env argument which needs to be specified with parent.frame() when used inside purrr::map() and similar functions.

library(fixest)
library(tidyverse)
data(trade)

fit_model <- function(mydat, env = environment()) {
  eval(bquote(fepois(Euros ~ log(dist_km) | Origin + Destination, data = .(substitute(mydat, env = env)))))
}

mod2 <- fit_model(trade)

update(mod2, . ~ . + log(Year))
#> Poisson estimation, Dep. Var.: Euros
#> Observations: 38,325 
#> Fixed-effects: Origin: 15,  Destination: 15
#> Standard-errors: Clustered (Origin) 
#>              Estimate Std. Error  t value  Pr(>|t|)    
#> log(dist_km) -1.51756   0.113171 -13.4095 < 2.2e-16 ***
#> log(Year)    72.36888   6.899699  10.4887 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -1.212e+12   Adj. Pseudo R2: 0.592897
#>            BIC:  2.424e+12     Squared Cor.: 0.384441

mod2$call
#> fepois(fml = Euros ~ log(dist_km) | Origin + Destination, data = trade)

res <- trade |>
  nest(.by = Year) |>
  mutate(fit = map(data, \(x) fit_model(x, parent.frame())))

res$fit[[1]]
#> Poisson estimation, Dep. Var.: Euros
#> Observations: 3,793 
#> Fixed-effects: Origin: 15,  Destination: 15
#> Standard-errors: Clustered (Origin) 
#>              Estimate Std. Error  t value  Pr(>|t|)    
#> log(dist_km) -1.48073   0.114878 -12.8896 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -1.082e+11   Adj. Pseudo R2: 0.573982
#>            BIC:  2.164e+11     Squared Cor.: 0.352497

res$fit[[1]]$call
#> fepois(fml = Euros ~ log(dist_km) | Origin + Destination, data = mydat)

Created on 2023-03-07 by the reprex package (v2.0.1)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • Thank you so much! I was actually trying to make the function work within `map()` on a nested dataframe. However, when I try to run `trade |> nest(.by = Year) |> mutate(fit = map(data, fit_model))` I get the error "Argument 'data' could not be evaluated. Problem: object '.x' not found." so it seems like there is yet another issue with when the data argument gets evaluated? Much appreciated if you find the time to have another look! – dufei Mar 07 '23 at 14:07
  • @dufei: See my update, I added an `env` argument which needs to be specified in functions like `map()`. – TimTeaFan Mar 07 '23 at 15:12
  • Great, thanks for the swift update! Do you have any pointers on where to read up on the topic of environments and evaluation? I guess Hadley Wickham's [Advanced R](https://adv-r.hadley.nz/environments.html) should be a good place to start? – dufei Mar 07 '23 at 15:24
  • @dufei: Yes exactly, thats a good start, but also not easy. In the end playing around with functions and environments helped me a lot to get a better understanding (which is still far from perfect). – TimTeaFan Mar 07 '23 at 15:33
1

The problem is, that the call looks like

mod2$call
# fepois(fml = Euros ~ log(dist_km) | Origin + Destination, data = df)

where data should be data = trade.

You could use an eval-parse approach. A little hacky, but works.

fit_model2 <- function(df) {
  eval(parse(text=sprintf('fepois(Euros ~ log(dist_km) | Origin + Destination, data = %s)', 
                          deparse(substitute(df)))))
}

mod2a <- fit_model2(trade)
mod2a$call
# fepois(fml = Euros ~ log(dist_km) | Origin + Destination, data = trade)

update(mod2a, . ~ . + log(Year))
# Poisson estimation, Dep. Var.: Euros
# Observations: 38,325 
# Fixed-effects: Origin: 15,  Destination: 15
# Standard-errors: Clustered (Origin) 
# Estimate Std. Error  t value  Pr(>|t|)    
# log(dist_km) -1.51756   0.113171 -13.4095 < 2.2e-16 ***
# log(Year)    72.36888   6.899699  10.4887 < 2.2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Log-Likelihood: -1.212e+12   Adj. Pseudo R2: 0.592897
#            BIC:  2.424e+12     Squared Cor.: 0.384441
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

Try to replace df with trade in the fit_model function as the fepois doesnt recognize the df data like this :

fit_model <- function(trade) {
  fepois(Euros ~ log(dist_km) | Origin + Destination, data = trade)
}

mod2 <- fit_model(trade)

update(mod2, . ~ . + log(Year))

Poisson estimation, Dep. Var.: Euros
Observations: 38,325 
Fixed-effects: Origin: 15,  Destination: 15
Standard-errors: Clustered (Origin) 
             Estimate Std. Error  t value  Pr(>|t|)    
log(dist_km) -1.51756   0.113171 -13.4095 < 2.2e-16 ***
log(Year)    72.36888   6.899699  10.4887 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -1.212e+12   Adj. Pseudo R2: 0.592897
           BIC:  2.424e+12     Squared Cor.: 0.384441
S-SHAAF
  • 1,863
  • 2
  • 5
  • 14
  • Thanks a lot! Just changing a "transitory" variable that lives inside a function feels like it shouldn't make a difference but it really does. Do you have an explanation? I assume it's because the model only contains a reference to the dataset name that is not evaluated? I figured that I can also just provide the dataset name again in `update()` which feels like a solution that is slightly more robust. – dufei Mar 03 '23 at 10:58
  • Yes, one solution to this issue is to explicitly pass the data frame as an argument to the update function, like this:update(mod2, formula = . ~ . + log(Year), data = trade). Or you could modify your fit_model function to include the data argument. – S-SHAAF Mar 03 '23 at 12:55