2

I have the dataframe below, where each row represents changes in text. I then use the adist() function to extract whether the change is a match (M), insertion (I), substitution (S) or deletion (D).

I need to find all of the indices of Is in the change column (illustrated here in the insrtion_idx column). Using those indices, I need to extract the corresponding characters in current_text (illustrated here in insertion_chars).

df <- tibble(current_text = c("A","AB","ABCD","ABZ"),
             previous_text = c("","A","AB","ABCD"),
             change = c("I","MI","MMII","MMSD"),
             insertion_idx = c(c(1),c(2),c(3,4),""),
             insertion_chars = c("A","B","CD",""))

I have tried splitting up strings and comparing string differences, but this gets very messy very fast with real-world data. How do I accomplish the above task?

Adam_G
  • 7,337
  • 20
  • 86
  • 148
  • 3
    Think you're in regex-land - there should be a duplicate for this somewhere as I think you want: `gr <- gregexpr("I", df$change); gr; regmatches(df$current_text, gr)` – thelatemail May 24 '23 at 23:06
  • Probably essentially a duplicate of https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match/23901600 – thelatemail May 24 '23 at 23:49
  • Just to expand on @thelatemail answer/comment: ```df$insertion_chars <- sapply(regmatches(df$current_text, gregexpr("I", df$change)), paste, collapse = "")``` this will give you the desired column. Or a little bit cleaner version: ```within(df, insertion_chars <- sapply(regmatches(current_text, gregexpr("I", change)), paste, collapse = ""))``` – M-- May 25 '23 at 00:22
  • 1
    @r2evans duplicate comment, LOL ;) – M-- May 25 '23 at 03:39

2 Answers2

3

Turning my comment about using gregexpr and regmatches into an answer.
A lot of this procedure is very similar to the content in this question - Extract a regular expression match - if you are looking for alternative methods.

df <- data.frame(current_text = c("A","AB","ABCD","ABZ"),
             previous_text = c("","A","AB","ABCD"),
             change = c("I","MI","MMII","MMSD"))

df$insertion_idx <- gregexpr("I", df$change)
df$insertion_chars <- sapply(regmatches(df$current_text, df$insertion_idx), 
                             paste, collapse="")
df
##  current_text previous_text change insertion_chars insertion_idx
##1            A                    I               A             1
##2           AB             A     MI               B             2
##3         ABCD            AB   MMII              CD          3, 4
##4          ABZ          ABCD   MMSD                            -1
thelatemail
  • 91,185
  • 12
  • 128
  • 188
2

Try this alternative to thelatemail's (excellent) recommendation (which also works):

quux <- structure(list(current_text = c("A", "AB", "ABCD", "ABZ"), previous_text = c("", "A", "AB", "ABCD"), change = c("I", "MI", "MMII", "MMSD")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

quux$insertion_idx <- lapply(strsplit(quux$change, ""), function(z) which(z == "I"))
quux$insertion_chars <- mapply(function(ctxt, idx) {
  if (length(idx)) paste(substring(ctxt, idx, idx), collapse = "") else ""
}, quux$current_text, quux$insertion_idx)
quux
# # A tibble: 4 × 5
#   current_text previous_text change insertion_idx insertion_chars
#   <chr>        <chr>         <chr>  <list>        <chr>          
# 1 A            ""            I      <int [1]>     "A"            
# 2 AB           "A"           MI     <int [1]>     "B"            
# 3 ABCD         "AB"          MMII   <int [2]>     "CD"           
# 4 ABZ          "ABCD"        MMSD   <int [0]>     ""             

Note that insertion_idx is a list-column with the indices you were looking for:

str(quux)
# tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
#  $ current_text   : chr [1:4] "A" "AB" "ABCD" "ABZ"
#  $ previous_text  : chr [1:4] "" "A" "AB" "ABCD"
#  $ change         : chr [1:4] "I" "MI" "MMII" "MMSD"
#  $ insertion_idx  :List of 4
#   ..$ : int 1
#   ..$ : int 2
#   ..$ : int [1:2] 3 4
#   ..$ : int(0) 
#  $ insertion_chars: Named chr [1:4] "A" "B" "CD" ""
#   ..- attr(*, "names")= chr [1:4] "A" "AB" "ABCD" "ABZ"
r2evans
  • 141,215
  • 6
  • 77
  • 149