3

Say I have the following string:

params <- "var1 /* first, variable */, var2, var3 /* third, variable */"

I want to split it using , as a separator, then extract the "quoted substrings", so I get 2 vectors as follow :

params_clean <- c("var1","var2","var3")
params_def   <- c("first, variable","","third, variable") # note the empty string as a second element.

I use the term "quoted" in a wide sense, with arbitrary strings, here /* and */, which protect substrings from being split.

I found a workaround based on read.table and the fact it allows quoted elements :

library(magrittr)
params %>%
  gsub("/\\*","_temp_sep_ '",.) %>%
  gsub("\\*/","'",.) %>%
  read.table(text=.,strin=F,sep=",") %>%
  unlist %>%
  unname %>%
  strsplit("_temp_sep_") %>%
  lapply(trimws) %>%
  lapply(`length<-`,2) %>%
  do.call(rbind,.) %>%
  inset(is.na(.),value="")

But it's quite ugly and hackish, what's a simpler way ? I'm thinking there must be a regex to feed to strsplit for this situation.

related to this question

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167

3 Answers3

2

You may use

library(stringr)
cmnt_rx <- "(\\w+)\\s*(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)?"
res <- str_match_all(params, cmnt_rx)
params_clean <- res[[1]][,2]
params_clean
## => [1] "var1" "var2" "var3"
params_def <- gsub("^/[*]\\s*|\\s*[*]/$", "", res[[1]][,3])
params_def[is.na(params_def)] <- ""
params_def
## => [1] "first, variable" ""                "third, variable"

The main regex details (it is actually (\w+)\s*)(COMMENTS_REGEX)?):

  • (\w+) - Capturing group 1: one or more word chars
  • \s* - 0+ whitespace chars
  • ( - Capturing group 2 start
  • /\* - match the comment start /*
  • [^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
  • (?:[^/*][^*]*\*+)* - 0+ sequences of:
    • [^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
  • / - closing /
  • )? - Capturing group 2 end, repeat 1 or 0 times (it means it is optional).

See the regex demo.

The "^/[*]\\s*|\\s*[*]/$" pattern in gsub removes /* and */ with adjoining spaces.

params_def[is.na(params_def)] <- "" part replaces NA with empty strings.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

Here you are

library(stringr)
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
# Split by , which are not enclosed in your /*...*/ 
params_split <- str_split(params, ",(?=[^(/[*])]*(/[*]))")[[1]]
# Extract matches of /*...*/, only taking the contents
params_def <- str_match(params_split, "/[*] *(.*?) *[*]/")[,2]
params_def[is.na(params_def)] <- ""
# Remove traces of /*...*/
params_clean <- trimws(gsub("/[*] *(.*?) *[*]/", "", params_split))
whalea
  • 301
  • 1
  • 7
1

You can wrap it in a function and use the (not well documented) (*SKIP)(*FAIL) mechanism in plain R:

getparams <- function(params) {
  tmp <- unlist(strsplit(params, "/\\*.*?\\*/(*SKIP)(*FAIL)|,", perl = TRUE))

  params_clean <- vector(length = length(tmp))
  params_def <- vector(length = length(tmp))

  for (i in seq_along(tmp)) {
    # get params_def if available
    match <- regmatches(tmp[i], regexec("/\\*(.*?)\\*/", tmp[i]))
    params_def[i] <- ifelse(identical(match[[1]], character(0)), "", trimws(match[[1]][2]))

    # params_clean
    params_clean[i] <- trimws(gsub("/(.*)\\*.*?\\*/", "\\1", tmp[i]))
  }

  return(list(params_clean = params_clean, params_def = params_def))
}

params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
getparams(params)

This splits the initial string using (*SKIP)(*FAIL) (see a demo on regex101.com) and analyzes the parts afterwards.


This yields a list:
$params_clean
[1] "var1" "var2" "var3"

$params_def
[1] "first, variable" ""                "third, variable"


Or, shorter with sapply:
getparams <- function(params) {
  tmp <- unlist(strsplit(params, "/\\*.*?\\*/(*SKIP)(*FAIL)|,", perl = TRUE))
  (p <- sapply(tmp, function(x) {
    match <- regmatches(x, regexec("/\\*(.*?)\\*/", x))
    def <- ifelse(identical(match[[1]], character(0)), "", trimws(match[[1]][2]))
    clean <- trimws(gsub("/(.*)\\*.*?\\*/", "\\1", x))
    c(clean, def)
  }, USE.NAMES = F))
}

Which will yield a matrix:

     [,1]              [,2]   [,3]             
[1,] "var1"            "var2" "var3"           
[2,] "first, variable" ""     "third, variable"

With the latter, you get the variable names with e.g. result[1,].

Jan
  • 42,290
  • 8
  • 54
  • 79