2

I would like to find the named functions I use frequently in my R scripts (ignoring operators such as "+" and "$" and "[). How to write an elegant and reliable regex that matches names of functions has stumped me. Here is a small example and my clumsy code so far. I welcome cleaner, more reliable, and more comprehensive code.

test1 <- "colnames(x) <- subset(df, max(y))" 
test2 <- "sat <- as.factor(gsub('International', 'Int'l', sat))"
test3 <- "score <- ifelse(str_detect(as.character(sat), 'Eval'), 'Importance', 'Rating')"
test <- c(test1, test2, test3)

The test object includes eight functions (colnames, subset, max, as.factor, gsub, ifelse, str_detect, as.character), and the first two twice. Iteration one to match them is:

(result <- unlist(strsplit(x = test, split = "\\(")))
 [1] "colnames"                               "x) <- subset"                          
 [3] "df, max"                                "y)"                                    
 [5] "sat <- as.factor"                       "gsub"                                  
 [7] "'International', 'Int'l', sat)))"       "score <- ifelse"                       
 [9] "str_detect"                             "as.character"                          
[11] "sat), 'Eval'), 'Importance', 'Rating')"

Then, a series of hand-crafted gsubs cleans the result from this particular test set, but such manual steps will undoubtedly fall short on other, less contrived strings (I offer one below).

(result <- gsub(" <- ", " ", gsub(".*\\)", "", gsub(".*,", "", perl = TRUE, result))))
 [1] "colnames"      " subset"       " max"          ""              "sat as.factor" "gsub"          ""             
 [8] "score ifelse"  "str_detect"    "as.character"

The object, test4, below includes the functions lapply, function, setdiff, unlist, sapply, and union. It also has indenting so there is internal spacing. I have included it so that readers can try a harder situation.

test4 <- "contig2 <- lapply(states, function(state) {
                             setdiff(unlist(sapply(contig[[state]], 
                                                   function(x) { contig[[x]]})), union(contig[[state]], state))"

(result <- unlist(strsplit(x = test4, split = "\\("))) 
(result <- gsub(" <- ", " ", gsub(".*\\)", "", gsub(".*,", "", perl = TRUE, result))))

BTW, this SO question has to do with extracting entire functions to create a package. A better way to extract functions from an R script?

EDIT after first answer

test.R <- c(test1, test2, test3) # I assume this was your first step, to create test.R
save(test.R,file = "test.R") # saved so that getParseData() could read it
library(dplyr)
tmp <- getParseData(parse("test.R", keep.source=TRUE))
tmp %>% filter(token=="SYMBOL") # token variable had only "SYMBOL" and "expr" so I shortened "SYMBOL_FUNCTION_CALL"
  line1 col1 line2 col2 id parent  token terminal text
1     1    1     1    4  1      3 SYMBOL     TRUE RDX2
2     2    1     2    1  6      8 SYMBOL     TRUE    X

Something happened with all the text. What should I have done?

Community
  • 1
  • 1
lawyeR
  • 7,488
  • 5
  • 33
  • 63

2 Answers2

9

Regexes might work, but you can use R itself to help you. I put your four lines into a file test.R, fixed the syntax problems & ran the following:

library(dplyr)

tmp <- getParseData(parse("test.R", keep.source=TRUE))

tmp %>% filter(token=="SYMBOL_FUNCTION_CALL")

##   line1 col1 line2 col2  id parent                token terminal         text
## 1      1    1     1    8   1      3 SYMBOL_FUNCTION_CALL     TRUE     colnames
## 2      1   16     1   21  11     13 SYMBOL_FUNCTION_CALL     TRUE       subset
## 3      1   27     1   29  19     21 SYMBOL_FUNCTION_CALL     TRUE          max
## 4      2    8     2   16  39     41 SYMBOL_FUNCTION_CALL     TRUE    as.factor
## 5      2   18     2   21  42     44 SYMBOL_FUNCTION_CALL     TRUE         gsub
## 6      3   10     3   15  72     74 SYMBOL_FUNCTION_CALL     TRUE       ifelse
## 7      3   17     3   26  75     77 SYMBOL_FUNCTION_CALL     TRUE   str_detect
## 8      3   28     3   39  78     80 SYMBOL_FUNCTION_CALL     TRUE as.character
## 9      5   12     5   17 119    121 SYMBOL_FUNCTION_CALL     TRUE       lapply
## 10     6    3     6    9 134    136 SYMBOL_FUNCTION_CALL     TRUE      setdiff
## 11     6   11     6   16 137    139 SYMBOL_FUNCTION_CALL     TRUE       unlist
## 12     6   18     6   23 140    142 SYMBOL_FUNCTION_CALL     TRUE       sapply
## 13     8   11     8   15 191    193 SYMBOL_FUNCTION_CALL     TRUE        union

As you can see, the text column has the names of the functions you called. This should work fine for all syntactically correct R files.

Note that it doesn't eval the code, just parses it.

EDIT test.R looks like this:

colnames(x) <- subset(df, max(y))
sat <- as.factor(gsub('International', 'Int\'l', sat))
score <- ifelse(str_detect(as.character(sat), 'Eval'), 'Importance', 'Rating')

contig2 <- lapply(states, function(state) {
  setdiff(unlist(sapply(contig[[state]],
                        function(x) { contig[[x]]})),
          union(contig[[state]], state))})
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • Awesome! I've never heard of getParseData(). I edited my question, however, to show the results, and welcome your further thoughts on where I must have gone wrong. – lawyeR Jan 03 '15 at 14:10
  • Excellent: This method finds out what objects actually are closures. Any regex-based method is doomed to failure because it only looks for possibly valid strings. -- I guess this comment applies equally to Gabor's answer as well :-) – Carl Witthoft Jan 03 '15 at 14:33
4

The code in the question does not have valid syntax but if we correct it:

test1 <- "colnames(x) <- subset(df, max(y))" 
test2 <- "sat <- as.factor(gsub('International', 'Intl', sat))"
test3 <- "score <- ifelse(str_detect(as.character(sat), 'Eval'), 'Importance', 'Rating')"
test <- c(test1, test2, test3)

then we can use findGlobals in the codetools package:

library(codetools)

f.text <- c("function(){", test, "}")
f <- eval(parse(text = f.text))
funs <- findGlobals(f, merge = FALSE)$functions

giving:

 > funs
 [1] "{"            "<-"           "as.character" "as.factor"    "colnames<-"  
 [6] "gsub"         "ifelse"       "max"          "str_detect"   "subset"    

Its not clear which functions you wish to exclude but if F is a character vector containing them then setdiff(funs, F) will give all but those.

Also see Finding out which functions are called within a given function and: Generating a Call Graph in R

Community
  • 1
  • 1
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for the code corrections. You also took out the apostrophe from "Int'l". Having tried the codetools technique on my test4, I realize that the code has to parse correctly for findGlobals to be able to work. For this question I just thought a string with lots of functions in it would do. Perhaps for regex, but not for codetools or hrbrmstr's method. – lawyeR Jan 03 '15 at 14:38
  • The question had 'Int'l' (a single quote within two single quotes) which is a syntax error so it had to be changed. – G. Grothendieck Jan 03 '15 at 15:27