0

Background: I have a dataset in R Studio containing around 1 million rows and 10 different columns. Name of dataframe = Pontus_DF I am interested in one particular column, called "Notes".

Desired action and output: I want to go through the column "Notes" for all the rows and identify the pattern "MIC?:" and then extract the word before MIC?: + the pattern itself + the word after MIC?: and put in a new column called "New_Notes". The problem is that the number of columns that I want to generate per row is not fixed, but depends on how many times the pattern "MIC?:" is found in the column "Notes" for that particular row.

Hope the desired output was clear and I'm happy to receive any comments about how to further clarify my problem.

All the best, Pontus

phiver
  • 23,048
  • 14
  • 44
  • 56
Pontus Hedberg
  • 301
  • 1
  • 2
  • 9

1 Answers1

0

EDIT:

Does this answer your question?

Sample data:

Pontus_DF <- data.frame(
  Notes = c("some words before MIC?: some words thereafter but before another MIC?: and so on",
            "here 's another MIC?: and one more MIC?: and why not yet another MIC?: etc.",
            "this too MIC?: and that MIC?: until the end")
)

Using str_extract_alland lookaround you can extract the words in question:

library(stringr)
Pontus_DF$Notes_new <- lapply(str_extract_all(Pontus_DF$Notes, "(?<=MIC\\?:\\s)\\b\\w+\\b|\\b\\w+\\b(?=\\sMIC\\?:)"), paste0, collapse = ",")

Result:

Pontus_DF
                                                                             Notes
1 some words before MIC?: some words thereafter but before another MIC?: and so on
2      here 's another MIC?: and one more MIC?: and why not yet another MIC?: etc.
3                                      this too MIC?: and that MIC?: until the end
                         Notes_new
1          before,some,another,and
2 another,and,more,and,another,etc
3               too,and,that,until 

                                                                   
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Pontus_DF$Notes_before <- unlist(str_extract_all(Pontus_DF$Notes, "\\b\\w+\\b\\s(?=MIC\\?:)")) Pontus_DF$Notes_after <- unlist(str_extract_all(Pontus_DF$Notes, "(?<=MIC\\?:\\s)\\b\\w+\\b")) did unfortunately not work, stating that replacement has "162 555 rows, data has 64306". Could this be because in many rows, the pattern MIC?: I am looking for is appearing multiple times and thus requiring the creation of a new column every time this is detected? Thanks! – Pontus Hedberg Jul 27 '20 at 11:55
  • Yes, that's right. I've edited the solution to accommodate several matches per string. – Chris Ruehlemann Jul 27 '20 at 12:08
  • Did the edited solution help you? – Chris Ruehlemann Jul 27 '20 at 12:43
  • That works better, but what happens is that Each row in the new column contains all the strings/words that was fulfilling the criteria for ALL the rows, i.e. not only for the specific row I am looking at. Any idea how to change this? – Pontus Hedberg Jul 27 '20 at 13:06
  • That is great! Thanks a lot! Just one final question: The word after MIC?: should not be stopped by a . but rather by a space, i.e. I want to be able to detect numeric values like 0.75. Thanks! – Pontus Hedberg Jul 27 '20 at 14:00
  • You can use this: `Pontus_DF$Notes_new <- lapply(str_extract_all(Pontus_DF$Notes, "(?<=MIC\\?:\\s)\\b\\w+(\\.\\w+)?\\b|\\b\\w+\\b(?=\\sMIC\\?:)"), paste0, collapse = ",")` – Chris Ruehlemann Jul 27 '20 at 14:27