0

I was trying to find the time pattern which contains any kind of am or pm with number before them and wanted to replace the whole pattern with --.

What I thought was to find the string with am or pm which may or may not contain a dot . before/between/after them, and then extract together with any number pattern before them until I reach a white space.

Here is the original data t0:

t0 <- c("29th October 2022 5-6pm", "12-1pm 02/11/22", "10:25 bike rack at bexley college erith", "November 2nd 2022, apm shop ", " between 7pm Thursday 27th October to Saturday 29th October 9am", "04/09/2022 at 4 a.m.", "4/09/2022 at 4.a.m.", "04/09/2022 at 4.a.m" , "28.10.22 between 1.30pm and midnight", " Sunday 30th October 2022 between 11am and 3pm", "30th October, approx 6pm", "03/11/2022", "02/11/22 at campus", "Between 15:15 and 21:10", "03/11/2022 7pm", " Between 5:30pm and 6:30pm on 31/10/2022", "10am-2pm 31 oct 2022", "31/10/22 5.15am", " Tuesday 25th October 2022. 10:30pm", "30/10/2022 6pm")

I then create two variables, t1 and t2, to store the search result and the gsub result, this is what I get:

library("stringr")

t1 <- t0[str_detect(t0, "\\s[\\s|0-9|\\.|:]+a\\.?m\\.?|p\\.?m\\.?")]
t2 <- t1 %>% gsub("\\s[\\s|0-9|\\.|:]+a\\.?m\\.?|p\\.?m\\.?","--", .)

> t1
 [1] "29th October 2022 5-6pm"                                         "12-1pm 02/11/22"                                                
 [3] "November 2nd 2022, apm shop "                                    " between 7pm Thursday 27th October to Saturday 29th October 9am"
 [5] "04/09/2022 at 4 a.m."                                            "4/09/2022 at 4.a.m."                                            
 [7] "04/09/2022 at 4.a.m"                                             "28.10.22 between 1.30pm and midnight"                           
 [9] " Sunday 30th October 2022 between 11am and 3pm"                  "30th October, approx 6pm"                                       
[11] "03/11/2022 7pm"                                                  " Between 5:30pm and 6:30pm on 31/10/2022"                       
[13] "10am-2pm 31 oct 2022"                                            "31/10/22 5.15am"                                                
[15] " Tuesday 25th October 2022. 10:30pm"                             "30/10/2022 6pm"   

> t2
 [1] "29th October 2022 5-6--"                                       "12-1-- 02/11/22"                                              
 [3] "November 2nd 2022, a-- shop "                                  " between 7-- Thursday 27th October to Saturday 29th October--"
 [5] "04/09/2022 at 4 a.m."                                          "4/09/2022 at--"                                               
 [7] "04/09/2022 at--"                                               "28.10.22 between 1.30-- and midnight"                         
 [9] " Sunday 30th October 2022 between-- and 3--"                   "30th October, approx 6--"                                     
[11] "03/11/2022 7--"                                                " Between 5:30-- and 6:30-- on 31/10/2022"                     
[13] "10am-2-- 31 oct 2022"                                          "31/10/22--"                                                   
[15] " Tuesday 25th October 2022. 10:30--"                           "30/10/2022 6--"   

While the desired result is:

> t2
[1] "29th October 2022--"                                              "-- 02/11/22"                                              
[3] " between-- Thursday 27th October to Saturday 29th October--"      "04/09/2022 at--"
[5] "4/09/2022 at--"                                                   "04/09/2022 at--"                                               
[7] "28.10.22 between-- and midnight"                                  " Sunday 30th October 2022 between-- and--"                   
[9] "30th October, approx--"                                           "03/11/2022--"                                                
[11] " Between-- and-- on 31/10/2022"                                  "----- 31 oct 2022"                                          
[13] "31/10/22--"                                                      " Tuesday 25th October 2022.--"                           
[15] "30/10/2022--"   

How should I correct the regex pattern?

sam
  • 49
  • 6

1 Answers1

1
t1 <- gsub("\\s?[-:0-9.]+\\s*[ap]\\.?m\\.?", "--", t0)
t1[t1 != t0]
#  [1] "29th October 2022--"                                        
#  [2] "-- 02/11/22"                                                
#  [3] " between-- Thursday 27th October to Saturday 29th October--"
#  [4] "04/09/2022 at--"                                            
#  [5] "4/09/2022 at--"                                             
#  [6] "04/09/2022 at--"                                            
#  [7] "28.10.22 between-- and midnight"                            
#  [8] " Sunday 30th October 2022 between-- and--"                  
#  [9] "30th October, approx--"                                     
# [10] "03/11/2022--"                                               
# [11] " Between-- and-- on 31/10/2022"                             
# [12] "---- 31 oct 2022"                                           
# [13] "31/10/22--"                                                 
# [14] " Tuesday 25th October 2022.--"                              
# [15] "30/10/2022--"                                               

The only difference between that and your professed "desired result" is in [12],

t1[t1 != t0][12]
# [1] "---- 31 oct 2022"
t2[12]
# [1] "----- 31 oct 2022"
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thank you very much! you result is actually better than my desired result! what's the meaning of the `\\1` is the `gsub` function? – sam Nov 06 '22 at 23:02
  • oops ... it's a mistake here ... not needed here (holdover from some previous testing on the data) ... editing out now – r2evans Nov 06 '22 at 23:54
  • no problem! but I remember I had also seen the `\\1` from other post that it means somethings like referring to the group 1 of the regex search, but just cannot recall exactly what it means not able to google the explanation. Could you explain in brief if it is used, what is the actual usage of it? thank you very much – sam Nov 07 '22 at 09:59
  • 1
    Something like `gsub(".*_([^_]+)_.*", "\\1", vec)` will keep the middle of a `_`-separated triplet; the `"\\1"` refers to the first (in this one, only) use of `(...)` parenthetic groups in the pattern. https://stackoverflow.com/a/22944075/3358272 might be useful here for a good reference, noting that R requires double-backslash for escapes, not singles. – r2evans Nov 07 '22 at 11:39