1

I have a file of macros exported from SAS that I want to parse in order to build documentation with R and markdown (I can't use existing external software due to security limitations at work).

Specifically I want to extract:

  • the name of the macro
  • parameters and their description
  • contents of two sections named USES and EXAMPLES
  • the body of the macro function

Unfortunately my lack of regex skills is hurting me again though I don't think the rules are that complicated.

See my example below and expected output:

my_text <- "
%macro macro_name_1
/*----------------------------------------------------------------------------------
optional macro description on one or several lines.
this section always starts with slash star dashes and ends with dashes star slash
and it never contains these combinations of characters in the text                                                         
----------------------------------------------------------------------------------*/
(param1 /* optional description of param1 */
,param2
,param3 /* optional description of param3 */
);
/* USES: 
some info on one or several lines,
always starts with 'slash star USES:'
and ends with 'star slash'
but doesn't contain these combinations of characters
*/
/* EXAMPLES:
some examples on one or several lines,
always starts with 'slash star EXAMPLES:' OR 'slash star EXAMPLE:'
and ends with 'star slash'
but doesn't contain these combinations of characters
*/
some code on one or several lines,
always after USES and EXAMPLE(S) sections
that may or not contain combinations of /* and */
%mend;

some text outside of a macro-mend pattern, which I wish to ignore

%macro macro_name_2
/*---------------------
desc of macro_name_2                                                     
---------------------*/
(x
,y /* desc of y*/
);
/* USES: something */
/* EXAMPLE:
example for macro_name2
*/
code2
%mend;

some more irrelevant text

%macro macro_name_3;
code3
%mend;
"

The output doesn't have to be identical to what I propose here but should have at least a similar structure (text is abbreviated for readability) :

expected_output <- tibble::tribble(
  ~'macro_name',          ~'description',                   ~'parameters',        ~'uses',        ~'examples',        ~'code',
  "macro_name_1",    "optional macro...",  list(param1="optional desc...",
                                               param2="",
                                               param3="optional desc..."), "Some info...", "some examples...", "some code...",
  "macro_name_2", "desc of macro_name_2",       list(x="", y="desc of y"),    "something",   "example for...",        "code2",
  "macro_name_3",                     "",                          list(),             "",                 "",        "code3")


# # A tibble: 3 x 6
#     macro_name          description parameters         uses         examples         code
#          <chr>                <chr>     <list>        <chr>            <chr>        <chr>
# 1 macro_name_1    optional macro... <list [3]> Some info... some examples... some code...
# 2 macro_name_2 desc of macro_name_2 <list [2]>    something   example for...        code2
# 3 macro_name_3                      <list [0]>                                      code3
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • too broad ? I gave a very specific expected output and example data stripped to the essential come on. This can be useful as is to many SAS users in the same situation as I am, way more useful and less confusing than splitting this in several regex questions. – moodymudskipper Apr 10 '18 at 10:55

2 Answers2

2

I believe this will get you very close to what you want. I haven't taken the time to put it into a tibble, but I suspect you can figure out how to arrange those components to your own preference.

It relies heavily on this answer. It only uses one function from stringr--str_extract_fixed is really convenient for getting the parameters and their descriptions organized.

library(stringr)

# Make one character string per macro
macro <- gsub("\\n", "   ", my_text)
macro <- regmatches(macro, gregexpr("(?=%macro).*?(?<=%mend)", macro, perl=TRUE))[[1]]


# MACRO NAME --------------------------------------------------------
macro_name <- 
  unlist(
    regmatches(macro, gregexpr("(?<=%macro ).*?(?= )", macro, perl = TRUE))
  )
macro_name <- 
  sub(";", "", macro_name)

# MACRO DESCRIPTION -------------------------------------------------

macro_desc <- 
  regmatches(macro, gregexpr("(?=/[*]).*?(?<=[*]/)", macro, perl = TRUE))

macro_desc <- vapply(macro_desc,
                     function(x) if (length(x)) x[1] else "",
                     character(1))
macro_desc <- gsub("(/[*][-]+|[-]+[*]/)", "", macro_desc)
macro_desc <- trimws(macro_desc)

# MACRO PARAMETERS --------------------------------------------------

param <- regmatches(macro, gregexpr("(?=\\().*?(?<=\\);)", macro, perl = TRUE))
param <- 
  vapply(param,
         function(x) if (length(x)) gsub("(\\(|\\)|;)", "", x) else "",
         character(1))
param <- strsplit(param, ",")

param <- 
  lapply(param,
         str_split_fixed,
         " ",
         n = 2)

param <- 
  lapply(param,
         function(x) trimws(gsub("(/[*]|[*]/)", "", x)))

# Clean out everything up to the end of the parameters
# This might be problematic if the combination of ');' appears
# before the end of the parameters definition

macro <- trimws(sub("^\\%macro.*?;", "", macro))

# MACRO USES --------------------------------------------------------

# Just in case there are multiple spaces between /* and USES, coerce it to 
# only one space.
macro <- sub("/[*] +USES", "/* USES", macro)
uses <- regmatches(macro, gregexpr("(?<=/[*] USES[:]).*?(?=[*]/)", macro, perl = TRUE))
uses <- vapply(uses,
               function(x) if (length(x)) trimws(x) else "",
               character(1))

# Clean out everything up to the end of the USES

macro <- trimws(sub(".+(?<=/[*] USES[:]).*?(?<=[*]/)", "", macro, perl = TRUE))

# MACRO EXAMPLES ----------------------------------------------------

# Just in case there are multiple spaces between /* and EXAMPLES, coerce it to 
# only one space.
macro <- sub("/[*] +EXAMPLE", "/* EXAMPLE", macro)
examples <- regmatches(macro, gregexpr("(?<=/[*] EXAMPLE).*?(?=[*]/)", macro, perl = TRUE))
examples <- vapply(examples,
                   function(x) if (length(x)) trimws(sub("^(S[:]|[:]) +", "", x)) else "",
                   character(1))

# Clean out everything up to the end of the EXAMPLES

macro <- trimws(sub(".+(?<=/[*] EXAMPLE).*?(?<=[*]/)", "", macro, perl = TRUE))

# MACRO BODY --------------------------------------------------------

# At this point, the body should be the only thing left in `macro`, except 
# for the `%mend` call

body <- trimws(sub("%mend(|.+)$", "", macro))

# RESULTS

macro_name # character vector
macro_desc # character vector
param      # list of two column matrices
uses       # character vector
examples   # character vector
body       # character vector
Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • There's an issue when you're parsing at the end of the parameters section, I think you want to erase intil the first semi colon, but you actually erase unstil the last one: `sub("^\\%macro.+(\\);|;)", "", "%macro hello(a,b);this works")` , `sub("^\\%macro.+(\\);|;)", "", "%macro hello(a,b);this is erased; this stays")` – moodymudskipper Apr 10 '18 at 14:35
  • Try the edit (`macro <- trimws(sub("^\\%macro.*?;", "", macro))`). No guarantee it will work. As you add more macros, you're bound to find more irregularities that this doesn't handle. – Benjamin Apr 10 '18 at 14:42
  • Thanks, I used something similar, yes there were several glitches but I have something to build on, I'll post a function as an answer once I have something I'm happy with. – moodymudskipper Apr 10 '18 at 16:00
2

Here is a solution inspired by @Benjamin (his solution showed a few problems with the real data so i reworked it):

sas2tibble <- function(txt){

# FULL MACROS
# macros all start by '%macro' and finish by '%mend;'
# These keywords are never used elsewhere
pattern     <- '%macro (.+?)%mend;'
full_macros <- regmatches(txt,gregexpr(pattern,txt))[[1]] # match pattern

# NAMES
# names are always after %macro and stop before either:
# - description       : starting with '/'
# - parameters        : starting with '('
# - end of statement  : ';'
pattern <- '%macro (.+?)[/\\(;]'
names   <- regmatches(full_macros,gregexpr(pattern,full_macros))  # match pattern
names   <- trimws(sub(pattern,"\\1",names))                       # extract (.+?) from matches and trim

# DESCRIPTIONS
# They are always between %macro and the first ';'
# They are between /*-- and --*/
pattern      <- '%macro (.+?)/\\*(-+)(.+?)(-+)\\*/(.+?);'
descriptions <- regmatches(full_macros,gregexpr(pattern,full_macros)) # match pattern
descriptions[lengths(descriptions)==0] <- ""                          # convert character(0) to ""
descriptions <- trimws(sub(pattern,"\\3",descriptions))               # extract (.+?) from matches and trim

# PARAMETERS
# They are always between %macro and the first ';'
# They are after description if it exists
# They are between '(' and ')'
pattern      <- '%macro (.+?)(/\\*(-+)(.+?)(-+)\\*/)*[\n ]*(\\((.+?)\\))(.+?);'
params <- regmatches(full_macros,gregexpr(pattern,full_macros)) # match pattern
params[lengths(params)==0] <- ""                          # convert character(0) to ""
params <- trimws(sub(pattern,"\\7",params))

pattern <- '/\\*(.+?)\\*/'
param_defs <- sapply(strsplit(params,","),function(x) {
  out <- regmatches(x,gregexpr(pattern,x)) # match pattern
  out[lengths(out)==0] <- ""                          # convert character(0) to ""
  out <- trimws(sub(pattern,"\\1",out))
})
params <- sapply(strsplit(params,","),function(x) trimws(sub(pattern,"",x)))

# USES
# always between '/* USES:' and next `*/`
pattern      <- '/\\* USES:(.+?)\\*/'
uses <- regmatches(full_macros,gregexpr(pattern,full_macros)) # match pattern
uses[lengths(params)==0] <- ""                          # convert character(0) to ""
uses <- trimws(sub(pattern,"\\1",uses))

# EXAMPLES
# always between '/* EXAMPLE(S):' and next `*/`
pattern  <- '/\\* EXAMPLES*:(.+?)\\*/'
examples <- regmatches(full_macros,gregexpr(pattern,full_macros)) # match pattern
examples[lengths(examples)==0] <- ""                          # convert character(0) to ""
examples <- trimws(sub(pattern,"\\1",examples))

# BODY
# after first ';' but not uses or examples
pattern <- '/\\* USES:(.+?)\\*/'
body <- sub(pattern,"",full_macros)
pattern <- '/\\* EXAMPLES*:(.+?)\\*/'
body <- sub(pattern,"",body)
pattern <- ";(.*?)%mend;"
body    <- regmatches(body,gregexpr(pattern,body)) # match pattern
body    <- trimws(sub(pattern,"\\1",body))               # extract (.+?) from matches and trim
tibble(full_code = full_macros,
       name = names,
       desc = descriptions,
       params = params,
       param_defs = param_defs,
       uses = uses,
       examples = examples,
       body = body)
}

View(sas2tibble(my_text))

I think that the cleanest solutions is to find the regex expression that matches the general structure, and then extract the relevant info using \\1, \\2 etc... I wasn't able to produce it unfortunately but this works fine.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • 1
    Your solution is both more robust and better explained. I wont' be offended if you mark it as the correct answer. (in fact, I think you should) – Benjamin Apr 11 '18 at 11:52
  • Thanks, I think I will once I have something I'm happy with, this still isn't good enough. – moodymudskipper Apr 11 '18 at 23:25