Studying the source code of primitive and internal R functions: How is R connected with C?

Question

Ben Bolkers' answer to this question and the article by Uwe Ligges are already very useful when I try to "decode" a primitive or internal R function. But how is a primitive R function connected with its corresponding C function? I guess that somehow .Primitive must provide this missing link. Take for example is.na:

> is.na
function (x)  .Primitive("is.na")

FUNTAB R_FunTab[] in file "names.c" contains

{"is.na",   do_isna,    0,  1,  1,  {PP_FUNCALL, PREC_FN,   0}},

which means that is.na uses the C function do_isna. do_isna is defined in file "coerce.c":

SEXP attribute_hidden do_isna(SEXP call, SEXP op, SEXP args, SEXP rho)
{
    SEXP ans, dims, names, x;
    R_xlen_t i, n;

    checkArity(op, args);
    check1arg(args, call, "x");

    if (DispatchOrEval(call, op, "is.na", args, rho, &ans, 1, 1))
    return(ans);
    PROTECT(args = ans);
#ifdef stringent_is
    if (!isList(CAR(args)) && !isVector(CAR(args)))
    errorcall_return(call, "is.na " R_MSG_list_vec);

#endif
    x = CAR(args);
    n = xlength(x);
    PROTECT(ans = allocVector(LGLSXP, n));
    if (isVector(x)) {
    PROTECT(dims = getAttrib(x, R_DimSymbol));
    if (isArray(x))
        PROTECT(names = getAttrib(x, R_DimNamesSymbol));
    else
        PROTECT(names = getAttrib(x, R_NamesSymbol));
    }
    else dims = names = R_NilValue;
    switch (TYPEOF(x)) {
    case LGLSXP:
       for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (LOGICAL(x)[i] == NA_LOGICAL);
    break;
    case INTSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (INTEGER(x)[i] == NA_INTEGER);
    break;
    case REALSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
    break;
    case CPLXSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (ISNAN(COMPLEX(x)[i].r) ||
                   ISNAN(COMPLEX(x)[i].i));
    break;
    case STRSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (STRING_ELT(x, i) == NA_STRING);
    break;

/* Same code for LISTSXP and VECSXP : */
#define LIST_VEC_NA(s)                          \
    if (!isVector(s) || length(s) != 1)             \
        LOGICAL(ans)[i] = 0;                    \
    else {                              \
        switch (TYPEOF(s)) {                    \
        case LGLSXP:                        \
        case INTSXP:                        \
            LOGICAL(ans)[i] = (INTEGER(s)[0] == NA_INTEGER);    \
            break;                      \
        case REALSXP:                       \
            LOGICAL(ans)[i] = ISNAN(REAL(s)[0]);        \
            break;                      \
        case STRSXP:                        \
            LOGICAL(ans)[i] = (STRING_ELT(s, 0) == NA_STRING);  \
            break;                      \
        case CPLXSXP:                       \
            LOGICAL(ans)[i] = (ISNAN(COMPLEX(s)[0].r)       \
                       || ISNAN(COMPLEX(s)[0].i));  \
            break;                      \
        default:                        \
            LOGICAL(ans)[i] = 0;                \
        }                           \
    }

    case LISTSXP:
    for (i = 0; i < n; i++) {
        LIST_VEC_NA(CAR(x));
        x = CDR(x);
    }
    break;
    case VECSXP:
    for (i = 0; i < n; i++) {
        SEXP s = VECTOR_ELT(x, i);
        LIST_VEC_NA(s);
    }
    break;
    case RAWSXP:
    /* no such thing as a raw NA */
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = 0;
    break;
    default:
    warningcall(call, _("%s() applied to non-(list or vector) of type '%s'"),
            "is.na", type2char(TYPEOF(x)));
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = 0;
    }
    if (dims != R_NilValue)
    setAttrib(ans, R_DimSymbol, dims);
    if (names != R_NilValue) {
    if (isArray(x))
        setAttrib(ans, R_DimNamesSymbol, names);
    else
        setAttrib(ans, R_NamesSymbol, names);
    }
    if (isVector(x))
    UNPROTECT(2);
    UNPROTECT(1);
    UNPROTECT(1); /*ans*/
    return ans;
}

But if we want to evaluate is.na(x=3) for example, how are the arguments call,op,args,rho generated? At least some external information must be used, x=3 is not enough. Moreover, at first glance x=3 is not used at all, which must be wrong of course:

> is.na
function (x)  .Primitive("is.na")

The R Code of .Primitive doesn't give a hint:

> .Primitive
function (name)  .Primitive(".Primitive")

Taking all this into account, it is not surprising that an apparently excellent copy isNA of is.na fails:

> isNA <- function (x)  .Primitive("is.na")
> isNA
function (x)  .Primitive("is.na")
> is.na
function (x)  .Primitive("is.na")
> isNA(x=3)
function (x)  .Primitive("is.na")
> is.na(x=3)
[1] FALSE

To put it straight: All of the C functions do_... have these arguments call,op,args,rho. By what formula are they calculated when a primitive R function is called?

`.Primitive` is a function that returns the matched primitive function; the value returned from typing `.Primitive("is.na")` is a "builtin" function. So, by the way you've defined `isNA`, it is a "closure" function that calls `.Primitive` which returns a builtin function: `isNA = function(x) .Primitive("is.na"); typeof(isNA); typeof(isNA(3)); isNA()(3)`. `isNA` will be like `is.na` if defined as `isNA = .Primitive("is.na"); isNA` — alexis_laz, Mar 25 '16 at 17:12

Martin Morgan · Accepted Answer · 2016-03-25T14:49:38.340

Great question. I started R under gdb R -d gdb, set a breakpoint at do_isna, then continued R and entered is.na(3).

$ R -d gdb
(gdb) run
Starting program: /home/mtmorgan/bin/R-3-3-branch/bin/exec/R --no-save --no-restore --silent
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> ## break, cntrl-C
Program received signal SIGINT, Interrupt.
0x00007ffff722fd83 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) b do_isna
Breakpoint 1 at 0x7ffff77e0b3b: file /home/mtmorgan/src/R-3-3-branch/src/main/coerce.c, line 1982.
(gdb) continue
Continuing.

> is.na(3)

Breakpoint 1, do_isna (call=0x1838888, op=0x628218, args=0x1838770, rho=0x63f648)
    at /home/mtmorgan/src/R-3-3-branch/src/main/coerce.c:1982
1982        checkArity(op, args);
(gdb)

At the gdb prompt I asked

(gdb) where
#0  do_isna (call=0x1838888, op=0x628218, args=0x1838770, rho=0x63f648) at /home/mtmorgan/src/R-3-3-branch/src/main/coerce.c:1982
#1  0x00007ffff7869170 in Rf_eval (e=0x1838888, rho=0x63f648) at /home/mtmorgan/src/R-3-3-branch/src/main/eval.c:717
#2  0x00007ffff78b36af in Rf_ReplIteration (rho=0x63f648, savestack=0, browselevel=0, state=0x7fffffffcaf0) at /home/mtmorgan/src/R-3-3-branch/src/main/main.c:258
...

Starting from #2, Rf_ReplIteration is the REPL (read-eval-print loop) trying to evalute is.na(3). It's provided with the environment where the function is being called from. By the time it calls Rf_eval() on line 258, it knows the environment and the call

(gdb) call Rf_PrintValue(rho)
<environment: R_GlobalEnv>
(gdb) call Rf_PrintValue(thisExpr)
is.na(3)

By #1 (eval.c:717), R has figured out the values of op and tmp.

(gdb) call Rf_PrintValue(op)
function (x)  .Primitive("is.na")
(gdb) call TYPEOF(op)
$2 = 8

(type 8 is 'BUILTINSXP', from the table in Rinternals.h). It does this by finding out that e is a LANGSXP (line 614), that is.na is a SYMSXP (line 670), and that the function it references (op) is a BUILTINSXP (line 700). It then uses (line 717)

(gdb) call PRIMFUN(op)
$8 = (SEXP (*)(SEXP, SEXP, SEXP, SEXP)) 0x7ffff77e0b20 <do_isna>

to discover that it should invoke do_isna with the values it's discovered.

Hopefully that removes some of the mystery, and points to relevant parts of the code.

Studying the source code of primitive and internal R functions: How is R connected with C?

1 Answers1