3

I would like to be able to control the hierarchy of elements I extract from a search string.

Specifically, in the string "425 million won", I would like to extract "won" first, but then "n" if "won" doesn't appear.

I want the result to be "won" for the following:

stringr::str_extract("425 million won", "won|n")

Note that specifying a space before won in my regex is inadequate because of other limitations in my data (there may not necessarily be a space between "million" and "won"). Ideally, I would like to do this using regex, as opposed to if-else clauses because of performance considerations.

matsuo_basho
  • 2,833
  • 8
  • 26
  • 47
  • "do this using regex [...] because of performance considerations" since when do regex mean good performances? They're a handy tool, but rarely an efficient one. In this case I'd expect a solution with a single regex to have terrible performances, especially compared to a plain-text search. – Aaron Jan 26 '18 at 17:29

2 Answers2

3

See code in use here

pattern <- "^(?:(?!won).)*\\K(?:won|n)"
s <- "425 million won"
m <- gregexpr(pattern,s,perl=TRUE)
regmatches(s,m)[[1]]

Explanation

  • ^ Assert position at the start of the line
  • (?:(?!won).)* Tempered greedy token matching any character except instances where won proceeds
  • \K Resets the starting point of the match. Any previously consumed characters are no longer included in the final match
  • (?:won|n) Match either won or n
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • chwheels, what is the procedure by which a string is extracted in my example. That is to say, since I specify won first in the string, why isn't "won" in the output? Just trying to see if I can keep the script simple. – matsuo_basho Jan 26 '18 at 18:52
  • 1
    @matsuo_basho it's because you're getting the `n` from `million`. It's returning the first match. This answer doesn't care for that `n` since it's going to look for `won` first and if it can't find it, it'll look for `n` – ctwheels Jan 26 '18 at 18:57
1

If you just want to extend on the code you already have:

 na.omit(str_extract("420 million won", c("won", "n")))[1]
Daniel
  • 2,207
  • 1
  • 11
  • 15