1

I'm working with long strings in R such as:

string <- "end of section. 3. LESSONS. The previous LESSONS are very important as seen in Figure 1. This text is also important. Figure 1: Blah blah blah".

I would like to extract the substring between the first occurrence of 'LESSONS' and the last occurrence of 'Figure 1', as follows:

"The previous LESSONS are very important as seen in Figure 1. This text is also important."

I tried the following but it returns the substring after the last occurence of 'LESSONS', not the first:

gsub(".*LESSONS (.*) Figure 1.*", "\\1", string)
#[1] "are very important as seen in Figure 1. This text is also important."

Also tried the following but it cuts the string after the first occurrence of 'Figure 1', not the last:

library(qdapRegex)
ex_between(string, "LESSONS", "Figure 1")
#[[1]]
#[1] ". The previous LESSONS are very important as seen in"

I'd appreciate any help!

2 Answers2

0

You were very close. Make the regex non-greedy at the before "LESSONS" so that it matches the first one.

Also, here you can use only sub instead of gsub.

sub(".*?LESSONS\\.\\s*(.*) Figure 1.*", "\\1", string)
#[1] "The previous LESSONS are very important as seen in Figure 1. This text is also important."
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

You can use str_extract from the package stringr as well as positive lookbehind in (?<=...)and positive lookahead in (?=...) to define those parts of the string that delimit the part you want to extract:

str_extract(string, "(?<=LESSONS\\.\\s).*(?=\\sFigure 1)")
[1] "The previous LESSONS are very important as seen in Figure 1. This text is also important."
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34