-2

I have a problem with matching two rows in a table

r1 <- "ghuytut3jilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df2 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))

I want to specify a column number (say column number 5) and find how many columns to the right and left of the column the the two sequences are identical. I want a mechanism for comparing these two strings character by character (column by column) starting from the centre and scanning outwards, returning the number of matching characters until the first difference is encountered

user3069326
  • 575
  • 1
  • 7
  • 10
  • so... `ident<-TRUE; while(ident) { ident<-df2[1,5-j]==df2[2,5+j];j – Carl Witthoft Jan 10 '14 at 15:22
  • could you potnetially explain the last bit of the answer in a bit more detail/ – user3069326 Jan 10 '14 at 15:54
  • Well, like `r1right<-df2[1,1:5] ; r1left <- df1[1,6:(dim(df2)[2])]` . Then `r1left<-rev(r1left)` , and then find or write some function which counts how many characters are the same (`r1left[j]==r1right[j]`) . Probably not worth the effort :-) – Carl Witthoft Jan 10 '14 at 16:53
  • 1
    Why are you reposting basically [the same question](http://stackoverflow.com/questions/20970770/matching-of-patterns-in-r), furthermore, a question you commented has been answered? – BrodieG Jan 10 '14 at 18:45

1 Answers1

1

I think the function I've seen that's most helpful for "number in a row" questions is rle, which computes the run length encoding of a vector. For instance, you can see the run lengths of characters being the same or different in your strings with:

r1 = "ghuytut3jilujshdftgu"
r2 = "ghuytuthjilujshdftgu"
spl1 = unlist(strsplit(r1, ""))
spl2 = unlist(strsplit(r2, ""))
rle(spl1 == spl2)
# Run Length Encoding
#   lengths: int [1:3] 7 1 12
#   values : logi [1:3] TRUE FALSE TRUE

For your problem, you want to compute the run length of matches starting from some interior index i, both forward and backward. Here's an implementation of that, using rle (function assumes strings are same length and i is a valid index; forward and backward run lengths include the character at index i):

fxn = function(r1, r2, i) {
  spl1 = unlist(strsplit(r1, ""))
  spl2 = unlist(strsplit(r2, ""))
  if (spl1[i] != spl2[i]) {
    return(list(forward=0, backward=0))
  }
  rle.backward = rle(spl1[i:1] == spl2[i:1])
  rle.forward = rle(spl1[i:nchar(r1)] == spl2[i:nchar(r2)])
  return(list(forward=rle.forward$lengths[1], backward=rle.backward$lengths[1]))
}
fxn(r1, r2, 5)
# $forward
# [1] 3
# 
# $backward
# [1] 5

fxn(r1, r2, 9)
# $forward
# [1] 12
# 
# $backward
# [1] 1
josliber
  • 43,891
  • 12
  • 98
  • 133
  • +1 for using my favorite function. Not sure how well this scales (time penalty) for very large strings. – Carl Witthoft Jan 10 '14 at 19:02
  • @CarlWitthoft good point -- it would probably be slower than a loop in the case of long strings with many differences. – josliber Jan 10 '14 at 19:11