1

I have text I am cleaning up in R. I want to use stringi, but am happy to use other packages.

Some of the words are broken over two lines. So I get a sub-string "halfword-\nsecondhalfword".

I also have strings that are just "----\nword" and " -\n" (and some others that I do not want to replace.

What I want to do is identify all sub-strings "[a-z]-\n" and then keep the generic letter [a,z], but remove the -\n characters.

I do not want to remove all -\n , and I do not want to remove the letter [a-z].

Thanks!

  • Have you tried word boundaries yet? – Wiktor Stribiżew Apr 19 '19 at 16:30
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Make it easy for us to copy/paste the tests rather than extract them from your text. – MrFlick Apr 19 '19 at 16:42
  • Perhaps you just want `gsub("([a-z])-\n", "\\1", x)`? – MrFlick Apr 19 '19 at 16:44

1 Answers1

0

You may make use of word boundaries to match -<LF> only in between word characters:

gsub("\\b-\n\\b", "", x)
gsub("(*UCP)\\b-\n\\b", "", x, perl=TRUE)
stringr::str_replace_all(x, "\\b-\n\\b", "", x)

The latter two support word boundaries between any Unicode word characters.

See the regex demo.

If you want to only remove -<LF> between letters you may use

gsub("([a-zA-Z])-\n([a-zA-Z])", "\\1\\2", x)
gsub("(\\p{L})-\n(\\p{L})", "\\1\\2", x, perl=TRUE)
stringr::str_replace_all(x, "(\\p{L})-\n(\\p{L})", "\\1\\2")

If you need to only support lowercase letters, remove A-Z in the first gsub and replace \p{L} with \p{Ll} in the latter two.

See this regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563