1

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".

I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():

# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"

# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X  JAN END"

# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
#      start end
# [1,]    12  16

# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

2 Answers2

5

You are looking for lookarounds:

(?<=\d)DEC(?=[A-Z])

See a demo on regex101.com.


Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).

A short memo:
  • (?=...) is a pos. lookahead
  • (?!...) is a neg. lookahead
  • (?<=...) is a pos. lookbehind
  • (?<!...) is a neg. lookbehind
Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    So ```str_replace_all(string = txt, pattern = paste("(?<=\\d)", patt_month, "(?=[A-Z])", sep=''), replacement = " \\1 ")``` does it perfectly. Additional escapes added just for R compatibility. Thanks a lot. – Dmitry D. Onishchenko Mar 02 '20 at 12:54
  • @DmitryD.Onishchenko: You're welcome, glad to help. – Jan Mar 02 '20 at 12:57
0

A Base R version

patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR  
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing 
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above

Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.

[1] "START1SEP2 1 JAN Y JANEND"
GWD
  • 1,387
  • 10
  • 22