0

How can I use grep to find a partial string match and return only a portion of that string to a new column in my existing data table?
For example, I have a column $remarks in dt:"FC3"
Some of the remarks say "blah blah blah 57 DAYS LATE blah blah"
Is there a statement I can use to grab the '57 DAYS LATE' part and put it in a new column? Of course, it is not always 57 days, sometimes its 145, sometimes just 8 - so the length of the string is dynamic.

per request: here is example/reproducable data (i think this is what you were asking for)

7548    1D10000 2016    2016 CAT 1 WAS SUBMITTED 9 DAYS LATE  
3647    1D10001 2011    PENALTY PAID   
3547    1D39949 2013    2013 CAT 1 WAS 57 DAYS LATE SUBMIT  

here, the column containing the strings "2016 CAT 1 WAS SUBITTED 9 DAYS LATE", as well as the string "2013 CAT 1 WAS 57 DAYS LATE SUBMIT" in their respective rows are the strings I am referring to. What process would be best for me to be able to search for, grab, and place the "9 DAYS LATE" or the "57 DAYS LATE" strings in a new column?

The column name containing the string i want is FC4$remarks

Thank you and I hope this clarifies!

oguz ismail
  • 1
  • 16
  • 47
  • 69
few fda
  • 1
  • 2
  • Not `grep`, but there are other regular expression functions that can probably do what you are asking. Please provide a reproducible example for people to look at and test. – lmo Mar 29 '17 at 18:43
  • Thanks Imo - I added some rows from the actual d.table if that helps. Someone gave me a great starter code, but it leaves part of the string that I dont need. – few fda Mar 29 '17 at 19:21
  • also, Frank, thank you so much for providing the documentation so I can reproduce my table results neatly like you did. I apologize for my inexperience, but I will learn! – few fda Mar 29 '17 at 19:23
  • Fyi, you can @name people to ping them. Your table still isn't *easily* reproducible by others, fyi. Have a look at http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 The edit I made was for formatting: you can select code and press CTRL+K or indent by four spaces to display code blocks. – Frank Mar 29 '17 at 19:33

1 Answers1

0

you can use gsub() and a capture expression, like this:

dt <- data.table(remarks = c("blah blah blah 57 DAYS LATE blah blah",
                             "blah blah blah 145 DAYS LATE blah blah",
                             "2013 CAT 1 WAS 123 DAYS LATE SUBMIT",
                             "2016 CAT 1 WAS SUBMITTED 9 DAYS LATE"))

dt$new_column <- gsub(".* (\\d+ DAYS LATE).*", "\\1", dt$remarks)
# captures one or more consecutive digits and the string " DAYS LATE"

dt
                                  remarks    new_column
1:  blah blah blah 57 DAYS LATE blah blah  57 DAYS LATE
2: blah blah blah 145 DAYS LATE blah blah 145 DAYS LATE
3:    2013 CAT 1 WAS 123 DAYS LATE SUBMIT 123 DAYS LATE
4:   2016 CAT 1 WAS SUBMITTED 9 DAYS LATE   9 DAYS LATE

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.9.6 magrittr_1.5     dplyr_0.5.0      purrr_0.2.2      readr_1.0.0      tidyr_0.6.1      tibble_1.2      
 [8] tidyverse_1.0.0  ggmap_2.7        ggplot2_2.2.1 
Nate
  • 10,361
  • 3
  • 33
  • 40
  • Thanks Nate, this gets me started. However, my only issue is that in some of my rows, there are multiple instances of numbers. (eg: 2013 CAT 1 WAS 57 DAYS LATE SUBMIT). your code results in the replication part of the string as follows "2013 CAT1 WAS 57 DAYS LATE". I am solely interested in the "57 DAYS LATE" or even "57 DAYS" part. I plan then, to strip the DAYS LATE part and perform math on the number that remains. Thank you so much! – few fda Mar 29 '17 at 19:20
  • This minor modification may be a bit closer to the end result: `gsub(".* (\\d+) DAYS LATE.*", "\\1", "2016 CAT 1 WAS SUBMITTED 9 DAYS LATE")`. – lmo Mar 29 '17 at 19:29
  • @Imo I tried that, but that just replaces everything in the last column with "2016 CAT 1 WAS SUBMITTED 9 DAYS LATE". If possible I just want it to return "9 DAYS LATE" – few fda Mar 29 '17 at 20:10
  • @few fda Hmm, I'm not sure what is causing your issues, I tested your "special" scenarios and seems to work fine for me. – Nate Mar 30 '17 at 03:19