2

I’m doing some text processing for an analysis but have run into a problem where I need to remove only “\n” (directly followed by a word) in parentheses. I think an example can clarify what I want to do:

“\nMr. Johnson (spoke in \nRussian) (United Kingdom and \nNortheren Ireland) \n” 

I need the “\n” before Mr. Johnson for another task, and therefore I’m only interested in removing \n if it is inside a parenthesis and keeping the rest of text, so to get the followint output:

“\nMr. Johnson (spoke in Russian) (United Kingdom and Northeren Ireland) \n”

My main idea is to make some kind of regex that can capture “\n” inside a parenthesis, and together with the str_replace_all() function from the stringr package remove it. Though, I have found out that it is easier said than done, and after some research and time, I have come up with two possible ways that either way can be a solution to this problem:

  1. I have come up with the following regex pattern “(spoke in (\n).*?)”. But it selects all the words inside the parenthesis, and furthermore it is not able to select cases were \n does not appear after “spoke in”
  2. Create a conditional regex pattern: Using the following sources for inspiration https://www.youtube.com/watch?v=k4Be42-sf0s, https://www.regular-expressions.info/conditional.html and Regular expression with if condition I have tried to create the following pattern: "(?(?=((.*?)))\n)" but it does not seem to work…

Therefore, I will here if anyone of you regex aficionados out there can help me solve this problem? I’m using R, and therefore is using the ICU regex engine.

All the best Erik

1 Answers1

1

You can use

x <- "\nMr. Johnson (spoke in \nRussian) (United Kingdom and \nNortheren Ireland) \n"
library(stringr)
str_replace_all(x, "\\([^()]*\\)", function(z) gsub("\n", "", z, fixed=TRUE) )
# => [1] "\nMr. Johnson (spoke in Russian) (United Kingdom and Northeren Ireland) \n"

Here, \([^()]*\) matches a substring between ( and ) with no ( and ) in between, and function(z) gsub("\n", "", z, fixed=TRUE) removes all line feed chars from each non-overlapping match.

A base R approach is also possible:

x <- "\nMr. Johnson (spoke in \nRussian) (United Kingdom and \nNortheren Ireland) \n"
gr <- gregexpr("\\([^()]*\\)", x)
mat <- regmatches(x, gr)
regmatches(x, gr) <- lapply(mat, function(z) sub("\n", "", z, fixed=TRUE))
x
# => [1] "\nMr. Johnson (spoke in Russian) (United Kingdom and Northeren Ireland) \n"

See this R demo online.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563