1

I have a R data frame that contains a column with the following type of strings:

DBR 0 1/2 02/15/25
FRTR 3 04/25/22
BTPS 1.35 04/15/22

I would like to use a regular expression to match the portion of the string between the first space and the last space.

The output would therefore be:

0 1/2
3
1.35

Some background information:

These are bond descriptions. The first segment of the first line is a country key (DBR = Germany). The last part refers to the maturity date (for the first bond February 15th 2025).

In between the country key and the maturity date, several different conventions are used for the coupon of the bond. For example the German bond has a 0.5% coupon, the second (French) bond has a 3% coupon and the final (Italian) one has a 1.35% coupon.

I already figured out how to match the country key and maturity date via

^[^\\s]+ (for the country key)
[^\\s]+$ (for the maturity date)

Then I want to process the coupons to a uniform format after the match for further calculations.

0 1/2 > 0,5
3 > 3.0
1.35 > 1.35

The mixed formats of the coupons is also why I would like to extract only between the first and last space. the first bond for example has an additional space in the coupon.

Thank you.

RSX
  • 27
  • 7
  • Possible duplicate of [Extract a string of words between two specific words in R](https://stackoverflow.com/questions/28267400/extract-a-string-of-words-between-two-specific-words-in-r) – Rui Barradas Mar 28 '18 at 18:31

2 Answers2

3
sub(".*?\\s+(.*)\\s.*", "\\1", Strings, perl=TRUE)
[1] "0 1/2" "3"     "1.35"

A little detail:

.\*?   matches anything but stops at the first match of what follows
\\s+   matches one or more blank spaces
(.\*)  matches any number of characters, because it is in parentheses
       it becomes a capture group and is stored in the variable \1
\\s    waits for another blank, this time, the last one
.*     matches anything after the last blank
G5W
  • 36,531
  • 10
  • 47
  • 80
  • This works perfectly! Thank you! I only wonder if this PERL = TRUE option is necessary or this is a custom you always use? – RSX Mar 28 '18 at 19:06
0

Here is a complete walk-through in base R:

df <- data.frame(junk = c("DBR 0 1/2 02/15/25", "FRTR 3 04/25/22", "BTPS 1.35 04/15/22"), stringsAsFactors = FALSE)
df$coupon <- sapply(df$junk, function (item) {
  frac <- sub(".*?([\\d./]+)$", "\\1", sub(".*?\\s+(.*)\\s.*", "\\1", item, perl=TRUE), perl = TRUE)
  eval(parse(text = frac))
})
df


This yields
                junk coupon
1 DBR 0 1/2 02/15/25   0.50
2    FRTR 3 04/25/22   3.00
3 BTPS 1.35 04/15/22   1.35

The idea is to apply two regular expressions and eval()ing the result.


Or - using dplyr and some error handling:
library(dplyr)

df <- data_frame(junk = c("DBR 0 1/2 02/15/25", 
                          "FRTR 3 04/25/22", 
                          "BTPS 1.35 04/15/22",
                          "someweirdojunk"))

make_coupon <- function(col) {
  result <- sapply(col, function (item) {
    tryCatch({
      frac <- sub(".*?([\\d./]+)$", "\\1", sub(".*?\\s+(.*)\\s.*", "\\1", item))
      eval(parse(text = frac))
    }, error = function(e) {
      NA
    })
  })
  return(result)
}

df %>%
  mutate(coupon = make_coupon(junk))

This generates:

# A tibble: 4 x 2
  junk               coupon
  <chr>               <dbl>
1 DBR 0 1/2 02/15/25  0.500
2 FRTR 3 04/25/22     3.00 
3 BTPS 1.35 04/15/22  1.35 
4 someweirdojunk      NA    
Jan
  • 42,290
  • 8
  • 54
  • 79