0

I need to find instances of a LaTeX \index command in a whole bunch of knitr documents (.Rnw) which have commas in them. These may occur over multiple lines e.g.

\index{prior distribution,choosing beta prior for
$\pi$,vague prior knowledge}

I'm reasonably happy with my R code to find things:

line = paste(readLines(input), collapse = "\n")
r = gregexpr(pattern, line)

if(length(r) > 0){
    lapply(regmatches(line, r), function(e){cat(paste(substr(e, 0, 50), "\n"))})
}

However, I can't seem to get the regular expression right. I've tried

pattern = "(\\s)\\\\index\\{.*[,][^}]*\\}"

which gets some but not everything

pattern = "\\\\index\\{[A-Za-z \\s][^}]*\\}"

which gets more, but a lot I don't want. For example it finds

\index{posterior variance!beta distribution}

Any help appreciated.

James Curran
  • 1,274
  • 7
  • 23
  • It would help if you had a larger set of things to match or not in your example. Regardless, there's a multi-line flag `(?m)` you can set in perl-like regex. Something like `pattern = "(?m)^\\\\index\\{.*[,][^}]*\\}"`? You'll need to set `perl = TRUE` in `gregexpr`. – alistaire Apr 07 '16 at 04:37

1 Answers1

1

Often it is easier to use a multiple regexes in a row than one regex that gets exactly what you want. In your case:

library(stringr)
t = "\\index{prior distribution,choosing beta prior for
  \\$\\pi\\$,vague prior knowledge} bleh
\\index{posterior variance!beta distribution}"
cat(t)

tier_1 = str_match_all(t, "(?s)\\index\\{.*?\\}")[[1]]
tier_2 = tier_1[str_detect(tier_1, ",")]

The first regex finds all the \index{} stuff, across lines. The second keeps only those that have a comma.

This gets the first, and not the second. You can add more tiers to sort away stuff you don't want like this.

Community
  • 1
  • 1
CoderGuy123
  • 6,219
  • 5
  • 59
  • 89