1

I am currently taking a course that teaches textual analysis in R. As I am fairly new to R, I could not figure out yet how to cut all Lines after a specific set of characters.

For example, I have the following given:

documentName <- "Hello my name is Johann my had is the largest to be deleted X"

My desired outcome is:

documentName <- "Hello my name is Johann"

So far I have tried the following but it is not getting me anywhere.

gsub("(\Johann).*\\","",documentName)

Any hint would be much appreciated.

Martin Gal
  • 16,640
  • 5
  • 21
  • 39
Johann
  • 23
  • 4

2 Answers2

1

Here is one way, capturing all content appearing before Johann:

x <- "Hello my name is Johann my had is the largest to be deleted"
out <- sub("^(.*\\bJohann)\\b.*$", "\\1", x)
out

[1] "Hello my name is Johann"

Another approach, stripping off all content appearing after Johann:

sub("(?<=\\bJohann)\\s+.*$", "", x, perl=TRUE)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

You could use str_remove() from package dplyr

str_remove(documentName, "(?<=Johann).*")
[1] "Hello my name is Johann"

or adjust your gsub() regex to

gsub("(?<=Johann).*", "", documentName, perl=TRUE)
[1] "Hello my name is Johann"
Martin Gal
  • 16,640
  • 5
  • 21
  • 39