rm_between with multiple markers in an observation

Question

There are some helpful answers on here about using rm_between when each observation has only one instance of the markers. However I have a dataset where I want to extract things in ""'s and some of the observations have multiple instances of that. For example:

Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"

When I use this code,

library(qdapRegex)
rf <- data.frame(rm_between_multiple(H2$SE_DESC_EN, c("\"", "\""), c("\"", "\"")))

it creates a data frame and for that same line earlier

 "Fresh or chilled Atlantic salmon and Danube salmon"

is returned which is perfect. However I need the missing data. To try an retain it, I change my code slightly to:

H3 <- rm_between_multiple(H2$SE_DESC_EN, c("\"", "\""), c("\"", "\""), extract=TRUE)

to create a list with the data in the quotations. That same line returned is:

c("Salmo salar", " and Danube salmon ", "Hucho hucho", 
  "Salmo salar", " and Danube salmon ", "Hucho hucho")

Which has the data in quotations but also has some info in between the quotations and is being repeated. I'm fairly new at programming and was wondering if there is a way to write a code that will not included information between these quotations.

library(qdapRegex) is the package I'm using. – Tori Shannon Jun 29 '15 at 15:58 — Tori Shannon, Jun 29 '15 at 15:58

Tyler Rinker · Answer 1 · 2015-06-29T20:37:05.453

I think you don't need the rm_between_multiple just rm_between. Also there appears to be a regex issue in using the same left and right marker that I'm not sure if this is a bug yet. For now you can use the following to extract

x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'

rm_default(
    x, 
    pattern = S("@rm_between", '"'),
    extract=TRUE
)

## [[1]]
## [1] "\"Salmo salar\"" "\"Hucho hucho\""

Edit I think this is because the default regex of rm_between is to not include the left/right bounds. This uses the following regex "(?<=\").*?(?=\")". This use of lookaheads cause the left/right bounds to not be consumed and thus allows the quotation marks to be available for: " and Danube salmon ". This is (IMO) a bug that I will address but am unsure how yet.

Edit 2 I incorporated @hwnd's response into rm_between. The dev version of qdapRegex. You can instal the dev version via:

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_install_gh("trinker/qdapRegex"); p_load(qdapRegex)

and ...

rm_between(x, '"', '"', extract = TRUE)

## [[1]]
## [1] "Salmo salar" "Hucho hucho"

rm_between with multiple markers in an observation

1 Answers1