0

I have a large database of text, read as data frame with one column of text which has few sentences with time mentioned in different formats as below:

Row 1. I tried to call you on xxx-xxx-xxxx, however reached voice mail I'm scheduling our next follow up on 6/13/2018 between 12 PM and 2 PM PST.

Row 2. I will call you again today if I hear something from them, if not, will call you tomorrow between 4 - 6PM EST.

Row 3. We will await for your reply, if we don't hear from you then we will call you tomorrow between 12:00PM to 2:00PM CST

Row 4. As discussed over the call, we scheduled call back for tomorrow between 12 - 02 PM EST.

Row 5. As suggested by you, we will have our next follow up on 6/13/2018 between 12 PM TO 2 PM PST.

Would like to extract just the time part along with EST/CST/PST.

Expected Outputs:

6/13/2018 4 PM - 6 PM EST
tomorrow 12 PM TO 2 PM PST

Have tried the below:

x <- text$string

sc1 <- str_match(x, " follow up on (.*?) T.")

which returns something like:

follow up on 6/13/2018 between 1 PM TO | 6/13/2018 between 1 PM

Tried to combine other formats using below codes

sc2 <- str_match(x, " will call you tomorrow between (.*?) T.")

and do a rowbind to include both formats (follow up * and will call you*)

sc1rb <- rbind(sc1,sc2)

which did not workk

Any way to extract only the time part along with timezone from the above example strings?

Thanks in advance!

Mr Rj
  • 41
  • 1
  • 10
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please share data in a format that is easy to copy/paste into R. Also, make sure your test cases cover everything you are interested in matching as we can only test solutions using data you provide. Trying to match "free text" like this can be very tricky. – MrFlick Jun 13 '18 at 20:23
  • is this the expected output for all the four strings? – Onyambu Jun 13 '18 at 22:47
  • yes @Onyambu this is the expected out of all four strings. – Mr Rj Jun 14 '18 at 15:59

3 Answers3

1

Here's something that works for the sample. As @MrFlick mentioned, please try to share your data in a reproducible way.

Data

> dput(txt)
c("Next follow up on 6/13/2018 between 12 PM and 2 PM PST.", 
"will call you tomorrow between 4 - 6PM EST.", "will call you tomorrow between 12:00PM to 2:00PM CST", 
"will call you tomorrow between 11 AM to 12 PM EST", "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST."
)

code

> regmatches(txt, regexec('[[:space:]]([[:digit:]]{1,2}[[:space:]].*[[:upper:]]{3})', txt))
[[1]]
[1] " 12 PM and 2 PM PST" "12 PM and 2 PM PST" 

[[2]]
[1] " 4 - 6PM EST" "4 - 6PM EST" 

[[3]]
character(0)

[[4]]
[1] " 11 AM to 12 PM EST" "11 AM to 12 PM EST" 

[[5]]
[1] " 12 PM TO 2 PM PST" "12 PM TO 2 PM PST"

the output is a list wherein each element has two character vectors (read the help section for regmatches). You can simplify this further to get only the output indicated above:

> unname(sapply(txt, function(z){
   pattern <- '[[:space:]]([[:digit:]]{1,2}([[:space:]]|:).*[[:upper:]]{3})'
   k <- unlist(regmatches(z, regexec(pattern = pattern, z)))
   return(k[2])
 }))
[1] "12 PM and 2 PM PST"    "4 - 6PM EST"           "12:00PM to 2:00PM CST" "11 AM to 12 PM EST"   
[5] "12 PM TO 2 PM PST" 

This based on the sample input. Of course if the input is far too irregular, it'll be hard to use a single regex. If you have such a case, I'd recommend using multiple regex functions that are called one after the other depending on if the preceding ones return NA. Hope this is helpful!

Gautam
  • 2,597
  • 1
  • 28
  • 51
0

This code works for almost all your specifications, excepted this substring "4 - 6PM EST". I hope it would be useful on your whole data

  data=c(

  "Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",

  "will call you tomorrow between 4 - 6PM EST.",

  "will call you tomorrow between 12:00PM to 2:00PM CST",

  "will call you tomorrow between 11 AM to 12 PM EST",

  "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST.")



  #date exclusion with regex
  data=gsub( "*(\\d{1,2}/\\d{1,2}/\\d{4})*", "", data)


  #parameters for exlusion and substitution#
  excluded_texts=c("Next follow up on","between","will call you tomorrow",":00","\\.")
  replaced_input=c("  ","\'-","and","TO"," AM"," PM")
  replaced_output=c("","to","to","to","AM","PM")

  for (i in excluded_texts){
    data=gsub(i, "", data)}

  for (j in 1:length(replaced_input)){
    data=gsub(replaced_input[j],replaced_output[j],data)

  }

print(data)
Amara BOUDIB
  • 343
  • 2
  • 6
0
sub(".*?(\\d+\\s*[PA:-].*)","\\1",data)
[1] "12 PM and 2 PM PST."   "4 - 6PM EST."          "12:00PM to 2:00PM CST"
[4] "11 AM to 12 PM EST"    "12 PM TO 2 PM PST." 
Onyambu
  • 67,392
  • 3
  • 24
  • 53