12

I work for an organization with a number of internal R packages, all written many years ago. These are stored as .zip archives built on Windows under R 3.x. They cannot be loaded on Linux or macOS or under R 4.y without being rebuilt. Unfortunately, I do not have access to the package sources. They are lost to time...

I want to take these binaries, extract the source code, and repackage it according to current best practices (version control, roxygen2, testthat, etc.). What is the best way to do that?

I have already tackled one of the binaries by:

  1. Manually copying the source code of objects (exported and internal functions, data sets, etc.) in the loaded namespace to new .R files.
  2. Manually adding Roxygen blocks to the .R files in order to reproduce the help pages as displayed in the browser.

I am partly stuck at (1) because some of the functions are S4 generic. dput(<name>) gives new("standardGeneric", ...) as opposed to a simple function definition. Otherwise, the process has been fairly straightforward, but very time consuming.

Is there a way to programmatically "back engineer" source files from R package binaries, while handling S4 generic functions, classes, and methods correctly?

Everyone in the organization will be stuck on R 3.6 until this problem is resolved.

Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48
hokeybot
  • 195
  • 5
  • I doubt I have sufficient expertise for this, but shouldn't it be possible to iterate over the namespace without copying things by hand? – Greg Nov 11 '21 at 15:24
  • 2
    In the R-internal manual (https://cran.r-project.org/doc/manuals/r-patched/R-ints.pdf), pages 21-22 something is discussed regarding how the R source file is converted in the compiled package. I guess that if the package you are trying to rebuild has some compiled C/C++ code, there won't be much that you can do. – nicola Nov 11 '21 at 16:18
  • 1
    There's not a one-size-fits-all way to do this. Without any sort of [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) it's really impossible to offer specific suggestions. Every package is different and without knowing exactly what the requirements are or how to measure "success", this really can't be answered. – MrFlick Nov 11 '21 at 18:58
  • What might `dput()` do for these custom objects and for their methods? This answer [here](https://stackoverflow.com/a/3474049) should handle the complexities of reproducing nested objects, if `dput()` falls short. – Greg Nov 16 '21 at 20:18
  • 2
    @hokeybot Just bountied this question. I've wrestled with some reverse-engineering myself, and I'm curious as to the answer (if any) for the `standardGeneric`s and custom objects. **I suggest you post a snippet of SOURCE CODE for one of those items, so any "bounty hunters" have a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to work with**. – Greg Nov 23 '21 at 22:29

1 Answers1

9

Check out if this works in R 3.6.

Below script can automate least part of your problem by writing all function sources into separate and appropriately named .R files. This code will also take care of hidden functions.

Extracting code

# Use your package name
package_name <- "dplyr" 

# Extract all method names, including hidden
nms <- paste(lsf.str(paste0("package:", package_name), all.names = TRUE))

# Loop through the method names,
# extract head and body, and write them to R files
for (i in 1:length(nms)) {

    # Extract name
    nm <- nms[i]

    # Extract head
    hd_raw <- capture.output(args(nms[i]))
    # Collapse raw output, but drop trailing NULL
    hd <- paste0(hd_raw[-length(hd_raw)], collapse = "\n")

    # Extract body, collapse
    bd <- paste0(capture.output(body(nms[i])), collapse = "\n")
    
    # Write all to file
    write(paste0(hd, bd), file = paste0(nm, ".R"))
}

Extracting help files

To extract a functions's help text a similar way, you can use code from the following SO answers:

A starting point could be something like:

library(tools)
package_name <- "dplyr" 
db <- Rd_db(package_name)

# Extract all method names, including hidden
nms <- paste(lsf.str(paste0("package:", package_name), all.names = TRUE))

# Loop through the method names,
# extract Rd contents if they exist in this namespace, 
# and write them to new Rd files
for (i in 1:length(nms)) {
    
    # Extract name
    nm <- nms[i]
    
    rd_raw <- db[names(db) %in% paste0(nm, ".Rd")]
    if (length(rd_raw) > 0) {
        rd <- paste0(capture.output(rd_raw), collapse = "\n")
        # Write all to file
        write(rd, file = paste0(nm, ".Rd"))
    }
    
}
Roman
  • 4,744
  • 2
  • 16
  • 58
  • 1
    You can capture a function with a single line: `fn |> base::dput() |> utils::capture.output() |> base::paste0(collapse = "\n")` – Greg Nov 15 '21 at 14:45
  • Good point! Didn't think of using `dput` here. – Roman Nov 15 '21 at 15:22
  • 3
    Thank you, this is incredibly helpful! I only had to edit the help extraction code slightly (`".rd"` not `".Rd"`). Now i just need to figure out how to recreate the S4 custom generic... – hokeybot Nov 18 '21 at 08:58
  • 1
    @hokeybot To build off [@Roman](https://stackoverflow.com/users/9406040/roman)'s suggestion for help files, you might leverage the [**`Rd2roxygen`**](https://cran.r-project.org/web/packages/Rd2roxygen/vignettes/Rd2roxygen.html) package, to convert those `.rd` files back into [roxygen](https://roxygen2.r-lib.org/) comments. – Greg Dec 01 '21 at 21:53