1

Please consider the body of read.table as a text file, created with the following code:

sink("readTable.txt")
body(read.table)
sink()

Using regular expressions, I'd like to find all function calls of the form foo(a, b, c) (but with any number of arguments) in "readTable.txt". That is, I'd like the result to contain the names of all called functions in the body of read.table. This includes nested functions of the form
foo(a, bar(b, c)). Reserved words (return, for, etc) and functions that use back-ticks ('=='(), '+'(), etc) can be included since I can remove them later.

So in general, I'm looking for the pattern text( or text ( then possible nested functions like text1(text2(, but skipping over the text if it's an argument, and not a function. Here's where I'm at so far. It's close, but not quite there.

x <- readLines("readTable.txt")
regx <- "^(([[:print:]]*)\\(+.*\\))"
mat <- regexpr(regx, x)
lines <- regmatches(x, mat)
fns <- gsub(".*( |(=|(<-)))", "", lines)
head(fns, 10)
# [1] "default.stringsAsFactors()" "!missing(text))"
# [3] "\"UTF-8\")" "on.exit(close(file))" "(is.character(file))"
# [6] "(nzchar(fileEncoding))" "fileEncoding)" "\"rt\")"
# [9] "on.exit(close(file))" "\"connection\"))"

For example, in [9] above, the calls are there, but I do not want file in the result. Ideally it would be on.exit(close(.

How can I go about improving this regular expression?

oguz ismail
  • 1
  • 16
  • 47
  • 69
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • 2
    If it doesn't have to be a regex solution, consider `codetools::walkCode`. http://stackoverflow.com/questions/11872879/finding-out-which-functions-are-called-within-a-given-function/11878961#11878961 – GSee Jun 30 '14 at 01:33
  • @GSee, now that is interesting, and useful. For this problem, I'd like to stick with the regex method. This was a problem I had for an undergrad class and I'm just trying to improve upon my previous answer. Feel free to post that function as an answer though, it's pretty cool. – Rich Scriven Jun 30 '14 at 01:45
  • Can you provide sample input data your searching for as a match? – hwnd Jun 30 '14 at 01:50
  • 2
    Check out the mvbutils package. – G. Grothendieck Jun 30 '14 at 02:23
  • `I'd like to stick with the regex method` You can play around by adding strings to test in [this regex demo](http://regex101.com/r/oF6tD3/1) from my answer. Don't let the different colors faze you. The whole string is matched, but the color change at mid-stream indicates where Group 1 is matched (which we don't need to concern ourselves with.) – zx81 Jun 30 '14 at 03:08

2 Answers2

7

If you've ever tried to parse HTML with a regular expression you know what a nightmare it can be. It's always better to use some HTML parser and extract info that way. I feel the same way about R code. The beauty of R is that it's functional and you inspect any function via code.

Something like

call.ignore <-c("[[", "[", "&","&&","|","||","==","!=",
    "-","+", "*","/", "!", ">","<", ":")

find.funcs <- function(f, descend=FALSE) {
    if( is.function(f)) {
        return(find.funcs(body(f), descend=descend))
    } else if (is(f, "name") | is.atomic(f)) {
        return(character(0))
    }
    v <- list()
    if (is(f, "call") && !(deparse(f[[1]]) %in% call.ignore)) {
        v[[1]] <- deparse(f)
        if(!descend) return(v[[1]])
    } 
    v <- append(v, lapply(as.list(f), find.funcs, descend=descend))
    unname(do.call(c, v))
}

could work. Here we iterate over each object in the function looking for calls, ignoring those you don't care about. You would run it on a function like

find.funcs(read.table)

# [1] "default.stringsAsFactors()"                
# [2] "missing(file)"                             
# [3] "missing(text)"                             
# [4] "textConnection(text, encoding = \"UTF-8\")"
# [5] "on.exit(close(file))"                      
# [6] "is.character(file)"  
# ...

You can set the descend= parameter to TRUE if you want to look in calls to functions for other functions.

I'm sure there are plenty of packages that make this easier, but I just wanted to show how simple it really is.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Can you build this out as a tree or something? I can ask another question if need be. If the function is multiline, it would be nice to store in a list. The $ sign is maybe also an issue: `na.omit(github.df$links)`. I want to build a function called `traceforward` that tracks all unique functions, possibly sorted by "library" ... using `search` – mshaffer Apr 23 '21 at 01:24
4

Recursive Regex in Perl Mode

In the general case, I am sure you're aware of the hazards of trying to match such constructions: what if your file contains things like if() that you don't want to match?

That being said, I believe this recursive regex fits the requirements as I understand them

[a-z]+(\((?:`[()]|[^()]|(?1))*\))

See demo.

I'm not completely up to scratch on R syntax, but something like this should work, and you can tweak the function name and arguments to suit your needs:

grepl("[a-z]+(\\((?:`[()]|[^()]|(?1))*\\))", subject, perl=TRUE);

Explanation

  • [a-z]+ matches the letters before the opening parenthesis
  • ( starts Group 1
  • \( matches an opening parenthesis
  • (?: starts a non-capture group that will be repeated. The capture group matches several possibilities:
  • BACKTICK[()] matches a backtick + ( or ) (sorry, don't know how to make the backtick appear in this editor
  • |[^()] OR match one character that is not a parenthesis
  • |(?1) OR match the pattern defined by the Group 1 parentheses (recurse)
  • )* close non-capture group, repeat zero or more times
  • \) matches a closing parenthesis
  • ) ends Group 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • |(?1) does this cover if the argument is another function call? trying this in python using the 're' module. complains about ?1 – roocell Jan 12 '21 at 23:01