4

I am trying to extract strings between words. Consider this example -

x <-  "There are 2.3 million species in the world"

This may also take another form which is

x <-  "There are 2.3 billion species in the world"

I need the text between There and either 'million or billion, including them. The presence of million or billion is decided on run time, it is not decided before hand. So the output which I need from this sentence is

[1] There are 2.3 million OR
[2] There are 2.3 billion

I am using rm_between function from qdapRegex package for the same. Using this command I can extract only one of them at a time.

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE) 

OR I have to use

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

How can I write a command which can check presence of million or billion in the same sentence. Something like this -

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

I hope this is clear. Any help would be appreciated.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213

4 Answers4

3

You may use str_extact_all (for global matching) or str_extract (single match)

library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")

or

str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
3

The left and right arguments in rm_between takes a vector of character/numeric symbols. So you can use a vector with equal length in both left/right arguments.

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

Or

  sub('\\s*species.*', '', x)

data

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • thanks for the answer @akrun . However, the data which you have provided is not accurate. At any given time, only one of the sentence can be true. So either its "There are 2.3 million species in the world" OR its "There are 2.3 billion species in the world". – Ronak Shah Jul 25 '15 at 05:01
  • @RonakShah Please check my updates. It works for individual cases also. – akrun Jul 25 '15 at 05:06
2

With rm_between you can supply a vector for multiple markers of equal length as the doc states.

EDIT

See @TylerRinker's answer for the updated arguments for rm_between.

Although, another method that you can use a user defined regex would be rm_default :

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

Example:

library(qdapRegex)

x <-  c(
    'There are 2.3 million species in the world',
    'There are 2.3 billion species in the world'
)

rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"

## [[2]]
## [1] "There are 2.3 billion"
Community
  • 1
  • 1
hwnd
  • 69,796
  • 4
  • 95
  • 132
2

@hwnd's (my fellow qdapRegex co-author) response inspired a discussion that has lead to a new argument, fixed, for rm_between. The following description is in the dev version:

rm_between and r_between_multiple pick up a fixed argument. Previously, left and right boundaries containing regular expression special characters were fixed by default (escaped). This did not allow for the powerful use of a regular expression for left/right boundaries. The fixed = TRUE behavior is still the default but users can now set fixed = FALSE to work with regular expression boundaries. This new feature was inspired by @Ronak Shah's StackOverflow question: Extracting string between words using logical operators in rm_between function

To install the dev version:

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

Using qdapRegex version >= 4.1 you can do the following.

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"
Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519