2

I have a string vector data as shown below

Data
Posted by Mohit Garg on May 7, 2016
Posted by Dr. Lokesh Garg on April 8, 2018
Posted by Lokesh.G.S  on June 11, 2001
Posted by Mohit.G.S. on July 23, 2005
Posted by Dr.Mohit G Kumar Saha on August 2, 2019

I have used str_extract() function as

str_extract(Data, "Posted by \\w+. \\w+ \\w+")

It generated the output as

[1] "Posted by Mohit Garg on"   "Posted by Dr. Lokesh Garg" NA                         
[4] NA                          NA  

I want the output should like

[1] "Posted by Mohit Garg on"   "Posted by Dr. Lokesh Garg"  "Posted by Lokesh.G.S"                       
[4] "Posted by Mohit.G.S."                     "Posted by Dr.Mohit G Kumar Saha"
djMohit
  • 151
  • 1
  • 10

2 Answers2

2

Probably you can try :

stringr::str_extract(df$Data, "Posted by .+?(?=\\s+on)")

#[1] "Posted by Mohit Garg" "Posted by Dr. Lokesh Garg"  "Posted by Lokesh.G.S"
#[4] "Posted by Mohit.G.S." "Posted by Dr.Mohit G Kumar Saha"

This extracts everything from "Posted by" till "on" excluding "on".


Same in base R :

sub(".*(Posted by .+?)(?=\\s+on).*", '\\1', df$Data, perl = TRUE) 

data

df <- structure(list(Data = c("Posted by Mohit Garg on May 7, 2016", 
"Posted by Dr. Lokesh Garg on April 8, 2018", "Posted by Lokesh.G.S  on June 11, 2001", 
"Posted by Mohit.G.S. on July 23, 2005", "Posted by Dr.Mohit G Kumar Saha on August 2, 2019"
)), class = "data.frame", row.names = c(NA, -5L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

You can use sub and remove on and everything after it with *on.*.

sub(" +?on.*$", "", Data)
#[1] "Posted by momon"                 "Posted by on Mohit Garg"        
#[3] "Posted by Dr. Lokesh Garg"       "Posted by Lokesh.G.S"           
#[5] "Posted by Mohit.G.S."            "Posted by Dr.Mohit G Kumar Saha"

Or with perl = TRUE

sub("(.*) +on.*", "\\1", Data, perl = TRUE)

Data:

Data <- c("Posted by momon on Monday 29 Feb 2020"
, "Posted by on Mohit Garg on May 7, 2016"
, "Posted by Dr. Lokesh Garg on April 8, 2018"
, "Posted by Lokesh.G.S  on June 11, 2001"
, "Posted by Mohit.G.S. on July 23, 2005"
, "Posted by Dr.Mohit G Kumar Saha on August 2, 2019")

Have a look at R regex compiler working differently for the given regex.

GKi
  • 37,245
  • 2
  • 26
  • 48
  • This will delete "on" from the name if a name contains "on".. I want to remove "on" from the end only – djMohit May 26 '20 at 07:14
  • 1
    Now it should take the last on. – GKi May 26 '20 at 07:17
  • Using `sub(" +?on.*$", "", Data)` on `Data <- c("Posted by Ona'je Monday on 29 Feb 2020", "Posted by Ondrej on 29 Feb 2020")` gives me `"Posted by Ona'je Monday" "Posted by Ondrej"` what is what I would have expected when removing everything after the last `on`. – GKi May 26 '20 at 07:44
  • Using `sub(" +?on.*$", "", Data)` on `Data <- c("Posted by ondrej on 29 Feb 2020")` gives me `"Posted by ondrej"`, what I would have expected when removing everything after the last `on`. – GKi May 26 '20 at 07:52