3

I am looking for the code of the base R formula parser or interpreter, that translates the formula the user types into the variables and transformations used to bridge the data to the model matrix. A number of packages have their own formula interpreters that supplement or replace the base R interpreter, e.g. rmutil, gamlss.nl, ttBulk.

At a minimum, the following symbols have a distinct meaning in the formula context. I am looking for the code that implements that meaning.

~, 1, 0, +, -, *, /, :, ^, ., |, I, %in%

In addition, the functions below seem to be used mainly within the formula context, but I am not sure if they operate in a distinct way in that context. Some may have meaning only in model-fitting functions beyond lm or from particular packages. In some cases I am not sure that they have a meaning outside of the formula context.

C
||
poly
offset
strata
cluster
contrasts
ns
lo
bs
s

What I really want is an expository piece or tutorial at a level of detail that would let me figure out, e.g. which of the operations above commute, which are distributive over which other ones, which ones have inverse operations. But I gather that no such exposition exists.

I'd also like to get a complete list of functions that mean something different inside a formula, if such can be had. There is nothing in the R Language Definition or in R Internals about these special meanings, and, e.g., methods("|") gives me methods for hex and octal. The best discussion I have seen is still Statistical Models in S, Chap. 2, sect. 2.3.1, but I believe this is incomplete and maybe also not currant.

andrewH
  • 2,281
  • 2
  • 22
  • 32
  • I think this question might get closed, but I'm interested in this. If it was possible to get a bit more insight into R's formula parsing, it would be a lot easier to work out how to e.g. label terms and their interactions when displaying results. There are packages out there that seem to do this by basically applying regular expressions to the formula string, which seems less htan ideal. – Marius Mar 03 '20 at 03:47
  • I think [this](https://github.com/SurajGupta/r-source/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/src/model.c#L1650) is where the parsing happens for the `terms.formula` function, which is used in the `lm` function. – Bas Mar 03 '20 at 14:37
  • 1
    @Bas I think you are exactly right. This is not going to help me mutch I think -- it is written in a mixture or C and LISP (how do they even do that?) , neither of which languages I know at all well, and a lot of the meat of it is handled with pointer-oriented metaprogramming, which is beyond my competence. However, my actual question was "How can I find the source?", which you did. So if you could recount how you did it, that would be an answer. – andrewH Mar 04 '20 at 20:47
  • @Marius I think you are right that this is closable as a question about the location of a piece of code. However, as a meta-question about how I can find pieces of code _like_, I think it is valid. With both errors and source code, what you are looking for is often pretty far down the call stack from anything you wrote, and I have often found it difficult to figure out what function contains the relevant code, what it is doing, and what arguments it is using. – andrewH Mar 04 '20 at 21:13
  • Case in point: scanning through terms.formula looking for "I()", "*", and so forth, I missed that all the real meat of the function is in the one line "terms <- .External(C_termsform, x, specials, data, keep.order, allowDotAsName)", which I really should have caught. Then I would at least have known the next step was to remember or track down how to find the lookup tables for C code. – andrewH Mar 04 '20 at 21:25
  • A possible source of confusion or barrier to understanding is assuming that the regression functions' particular use of formulas is a process that is embedded in the functions you are considering `formula`-specific. The regression functions such as `lm` use `model.matrix` and `model.frame` to add meaning to the parsed formula components. – IRTFM Mar 31 '20 at 17:54
  • This question is already considered by some close-worthy. I agree, but rather on the basis of being too broad. There is no specific coding question here and it it appears to asking for a response (or a reference to one) that could take up a book chapter. I think referral to Hadley's Advanced R series might be useful, but this does need to be closed as it stands. – IRTFM Mar 31 '20 at 18:22

0 Answers0