Multiline regular expressions in R

Question

I need to find instances of a LaTeX \index command in a whole bunch of knitr documents (.Rnw) which have commas in them. These may occur over multiple lines e.g.

\index{prior distribution,choosing beta prior for
$\pi$,vague prior knowledge}

I'm reasonably happy with my R code to find things:

line = paste(readLines(input), collapse = "\n")
r = gregexpr(pattern, line)

if(length(r) > 0){
    lapply(regmatches(line, r), function(e){cat(paste(substr(e, 0, 50), "\n"))})
}

However, I can't seem to get the regular expression right. I've tried

pattern = "(\\s)\\\\index\\{.*[,][^}]*\\}"

which gets some but not everything

pattern = "\\\\index\\{[A-Za-z \\s][^}]*\\}"

which gets more, but a lot I don't want. For example it finds

\index{posterior variance!beta distribution}

Any help appreciated.

It would help if you had a larger set of things to match or not in your example. Regardless, there's a multi-line flag `(?m)` you can set in perl-like regex. Something like `pattern = "(?m)^\\\\index\\{.*[,][^}]*\\}"`? You'll need to set `perl = TRUE` in `gregexpr`. — alistaire, Apr 07 '16 at 04:37

score 1 · Accepted Answer · edited May 23 '17 at 10:28

1

Often it is easier to use a multiple regexes in a row than one regex that gets exactly what you want. In your case:

library(stringr)
t = "\\index{prior distribution,choosing beta prior for
  \\$\\pi\\$,vague prior knowledge} bleh
\\index{posterior variance!beta distribution}"
cat(t)

tier_1 = str_match_all(t, "(?s)\\index\\{.*?\\}")[[1]]
tier_2 = tier_1[str_detect(tier_1, ",")]

The first regex finds all the \index{} stuff, across lines. The second keeps only those that have a comma.

This gets the first, and not the second. You can add more tiers to sort away stuff you don't want like this.

edited May 23 '17 at 10:28

Community

1
1

answered Apr 07 '16 at 04:30

CoderGuy123

6,219
5
59
89

Thanks @Deleet. Using the second patters `pattern = "\\\\index\\{[A-Za-z \\s][^}]*\\}"` did the trick – James Curran Apr 07 '16 at 04:44

Multiline regular expressions in R

1 Answers1