3

I have a string of variable names and I want to extract the currencies as given by a vector from that. But I am having difficulties extracting the values.

My first approach was to replace all but the currencies abbreviations with nothing.

For example:

x <- c("Total Assets in th USD", "Equity in mil EUR", "Number of Branches")
currencies <- c("USD", "EUR", "GBP")

regex <- paste0("([^",
                paste(currencies, collapse = "|"),
                "])")
# results in
# "([^USD|EUR|GBP])"

gsub(regex, "", x)
# [1] "USD"  "EEUR" "B" 

The expected result would be c("USD", "EUR", "")

This is obviously wrong, as it matches the individual characters (E, U, R) instead of the character group (EUR). Now my question is, how can I extract only the given groups?

David
  • 9,216
  • 4
  • 45
  • 78
  • you can get a simple list of what matches what by just `sapply(currencies, function(y){ grep(pattern = y, x,value = F) })` – R.S. Dec 13 '16 at 16:41
  • If I search for `[r] regex currency` I find only one post that is related but but does not solve my issue (https://stackoverflow.com/questions/14159690/regex-grep-strings-containing-us-currency). Can you please ellaborate? – David Dec 13 '16 at 18:04
  • Darn. I got your post conflated with another one. If you just do any sort of edit (such as removing that unnecessary "thankyou" that is deprecated on SO) then my downvote can be reversed. – IRTFM Dec 13 '16 at 18:12
  • No worries, that is why i asked! :) – David Dec 13 '16 at 18:13

2 Answers2

3

You may use

x <- c("Total Assets in th USD", "Equity in mil EUR", "Number of Branches")
currencies <- c("USD", "EUR", "GBP")

regex <- paste0("\\b(",
                    paste(currencies, collapse = "|"),
                ")\\b")
# results in
# "\b(USD|EUR|GBP)\b"

regmatches(x, gregexpr(regex, x))

See the R demo online

Output:

[[1]]
[1] "USD"

[[2]]
[1] "EUR"

[[3]]
character(0)

If the currencies appear "glued" to numbers, you need to remove the word boundaries (\b).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Also, if you need to unlist the results and get empty strings, you might want to have a look at [this demo](http://ideone.com/Yvc1EA). *stringr* functions do that job cleaner though. – Wiktor Stribiżew Dec 13 '16 at 16:46
1

We can use str_extract

library(stringr)
str_extract(x, paste(currencies, collapse="|"))
#[1] "USD" "EUR" NA   

Or using sub from base R

v1 <- sub(paste0(".*\\b(", paste(currencies, collapse="|"), ")\\b.*"), "\\1", x)
replace(v1, !v1 %in% currencies, "")
#[1] "USD" "EUR" ""   
akrun
  • 874,273
  • 37
  • 540
  • 662