4

I would like to extract comments (matching to patterns) from my R source script keeping the functions in which they occurs.

The goal is to write documentation comments inside function body code using classic markdown checkboxes - [ ] or - [x] and extract those comments for further processing as list of character vectors - which I can easily write to new .md file.

Reproducible example and expected output below.

# preview the 'data'
script_body = c('# some init comment - not matching pattern','g = function(){','# - [x] comment_g1','# - [ ] comment_g2','1','}','f = function(){','# - [ ] comment_f1','# another non match to pattern','g()+1','}')
cat(script_body, sep = "\n")
# # some init comment - not matching pattern
# g = function(){
#     # - [x] comment_g1
#     # - [ ] comment_g2
#     1
# }
# f = function(){
#     # - [ ] comment_f1
#     # another non match to pattern
#     g()+1
# }

# populate R souce file
writeLines(script_body, "test.R")

# test it 
source("test.R")
f()
# [1] 2

# expected output
r = magic_function_get_comments("test.R", starts.with = c(" - [x] "," - [ ] "))
# r = list("g" = c(" - [x] comment_g1"," - [ ] comment_g2"), "f" = " - [ ] comment_f1")
str(r)
# List of 2
#  $ g: chr [1:2] " - [x] comment_g1" " - [ ] comment_g2"
#  $ f: chr " - [ ] comment_f1"
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
jangorecki
  • 16,384
  • 4
  • 79
  • 160

2 Answers2

4

Here’s a stripped-down, unevaluated variant of what hrbmstr has done:

get_comments = function (filename) {
    is_assign = function (expr)
        as.character(expr) %in% c('<-', '<<-', '=', 'assign')

    is_function = function (expr)
        is.call(expr) && is_assign(expr[[1]]) && is.call(expr[[3]]) && expr[[3]][[1]] == quote(`function`)

    source = parse(filename, keep.source = TRUE)
    functions = Filter(is_function, source)
    fun_names = as.character(lapply(functions, `[[`, 2))
    setNames(lapply(attr(functions, 'srcref'), grep,
                    pattern = '^\\s*#', value = TRUE), fun_names)
}

This comes with a caveat: since we don’t evaluate the source, we may miss function definitions (for instance, we wouldn’t find f = local(function (x) x)). The above function uses a simple heuristic to find function definitions (it looks at all simple assignments of a function expression to a variable).

This can only be fixed using eval (or source), which comes with its own caveats — for instance, it’s a security risk to execute files from an unknown source.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • jangorecki - you shld prbly accept this as the answer so i can delete mine (Konrad's is a much better solution) – hrbrmstr Sep 18 '15 at 13:08
  • @hrbrmstr Respectfully disagree. – Konrad Rudolph Sep 18 '15 at 13:10
  • @hrbrmstr I will accept this one but don't delete yours, it answers the question, it is good, it teach people, I see it valuable too! – jangorecki Sep 18 '15 at 13:10
  • _"it's a security risk…"_ given what `data.table` did with tracking code in their CRAN package vignettes, (`source`, `library`, etc al) is a security risk in general IMO. – hrbrmstr Sep 18 '15 at 13:11
  • @hrbrmstr Ouch, didn’t know about this. Sounds horrible. – Konrad Rudolph Sep 18 '15 at 13:13
  • 1
    Thanks both. I've made a wrapper for easier use on package development. Described on blog [here](http://jangorecki.github.io/blog/2015-09-18/Function-body-documentation.html) – jangorecki Sep 18 '15 at 19:10
  • Also live example: [logR source code description](https://github.com/jangorecki/logR/blob/804b6177e70043c92c83dc8840de391d9a107b76/inst/doc/doc.md). It seems to be nice to fill the gap between the source code and project functional requirements. If anybody is able to boost the code by matching number on whitespaces to produce nested lists in markdown feel free to contribute :) – jangorecki Oct 26 '15 at 13:54
2

It's unlikely anyone is going to write the grep / stringr::str_match part for you (this isn't a grunt code-writing service). But, the idiom for iterating over parsed function source might be useful enough to a broader audience to warrant inclusion.

CAVEAT This source()s the .R file, meaning it evaluates it.

#' Extract whole comment lines from an R source file
#'
#' \code{source()} an R source file into a temporary environment then
#' iterate over the source of \code{function}s in that environment and
#' \code{grep} out the whole line comments which can then be further
#' processed.
#' 
#' @param source_file path name to source file that \code{source()} will accept
extract_comments <- function(source_file) {

  tmp_env <- new.env(parent=sys.frame())
  source(source_file, tmp_env, echo=FALSE, print.eval=FALSE, verbose=FALSE, keep.source=TRUE)
  funs <- Filter(is.function, sapply(ls(tmp_env), get, tmp_env))

  lapply(funs, function(f) {
    # get function source
    function_source <- capture.output(f)
    # only get whole line comment lines
    comments <- grep("^[[:blank:]]*#", function_source, value=TRUE)
    # INCANT YOUR GREP/REGEX MAGIC HERE 
    # instead of just returning the comments
    # since this isn't a free code-writing service
    comments
  })

}

str(extract_comments("test.R"))
## List of 2
##  $ f: chr [1:2] "# - [ ] comment_f1" "# another non match to pattern"
##  $ g: chr [1:2] "# - [x] comment_g1" "# - [ ] comment_g2"
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • “I do this since source() preserves comments whereas parse does not.” — Nah, that’s not true, they behave the same. In fact, `source` uses `parse` internally. – Konrad Rudolph Sep 18 '15 at 12:52
  • My attempt at brevity was probably unwarranted (I know it keeps it but getting to the comments via parse is a pain). I'll re-phrase. – hrbrmstr Sep 18 '15 at 12:55
  • Evaluation is not a problem so your solution is perfect, I will work on regex myself - agree with your comment on that. Thanks! – jangorecki Sep 18 '15 at 12:56
  • @hrbrmstr I don’t find expressions more complex to deal with than evaluated functions (see my answer) but the answer is otherwise good. One huge caveat: your `source` is non-locally, so it will dump stuff into the global namespace. Make it evaluate in an empty environment instead. [Also, `source` has a bug under Windows](http://stackoverflow.com/q/24454559/1968) in that it cannot deal with UTF-8. The only workaround I know is to use `eval(parse(…))` instead of `source(…)`. – Konrad Rudolph Sep 18 '15 at 13:10
  • gd point. commentary removed and environment pollution localized. thx – hrbrmstr Sep 18 '15 at 13:21