0

I am deleting all line break dashes followed by a space('- ') from a character string in R, expect from those preceded by 'en' (has something to do with Dutch grammar). Using this example (gsub with exception in R) I got close to an answer, but just cannot figure it out completely.

This is an example of a string

string <- "word1 long- er word2, word3 en- word4"

expected result:

"word1 longer word2, word3 en- word4" 
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
L Smeets
  • 888
  • 4
  • 17

1 Answers1

1

An option can be negative regex lookaround to match a - followed by one or more spaces (\\s+) not (!) preceded by the character 'en'

gsub("(?<!en)(-\\s+)", "", string, perl = TRUE)
#[1] "word1 longer word2, word3 en- word4"

Or with SKIP/FAIL to make the matched pattern fail when the pattern is preceded by 'en'

gsub("(en-\\s+)(*SKIP)(*FAIL)|-\\s+", "", string, perl = TRUE)
#[1] "word1 longer word2, word3 en- word4"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @LSmeets. I think both should work work for similar patterns. There couldd be edge cases. If this has to be limited to only words and not substrings place a `\\b` before the `en` – akrun Aug 14 '19 at 15:37