0

I have a data.frame with two rows and 20 columns where each column holds one character, which roughly looks like this (columns scrunched here for clarity):

        Cols 1-20
  row1  ghuytuthjilujshdftgu 
  row2  ghuytuthjilujshdftgu

I want a mechanism for comparing these two strings character by character (column by column) starting from position 10 and scanning outwards, returning the number of matching characters until the first difference is encountered. In this case it is obvious that both lines are identical so the answer would be 20. The important thing would be that even if they are completely identical, as in the case above, there should not be an error message (it should be returned).

With this alternate example, the answer should be 12:

    Cols 1-20
row1  ghuytuthjilujshdftgu 
row2  XXXXXXXXjilujshdftgu

Here is some code to generate the data frames:

r1 <- "ghuytuthjilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))

r1 <- "ghuytuthjilujshdftgu"
r2 <- "XXXXXXXXjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))

Edit.

the class of the object is data.frame and it is subsettable- with dim = 2,20 (each column / character is accessible on its own)

tauculator
  • 147
  • 7
user3069326
  • 575
  • 1
  • 7
  • 10

1 Answers1

0

Here is an answer that splits the df into two pieces (left and right from center, reordering left so that it counts from center to first value), and then counts length by using cumsum and NA, so that cumsum turns to NA as soon as there is a mismatch, and then finds the highest index value that is not NA to represent the longest stretch starting from center without a mismatch.

sim_len <- function(df, center=floor(ncol(df) / 2)) {
  dfs <- list(df[, max(center, 1):1, drop=F], df[, center:ncol(df), drop=F])
  df.count <- lapply(dfs, function(df) {
    diff <- cumsum(ifelse(df[1, ] == df[2, ], 1, NA_integer_))
    diff[max(which(!is.na(diff)))]
  })
  max(0L, sum(unlist(df.count)) - 1L)  
}

And here are some examples of running it (the as.data.frame business is just creating the data frame from the character strings. Note that the "center" column is counted twice, hence the -1L in the final line of the function.

r1 <- "ghuytuthjilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))
sim_len(df1)
# [1] 20

r1 <- "ghuytut3jilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df2 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df2)
# [1] 12

r1 <- "ghuytut3jilujshdftgu"
r2 <- "ghuytuthjilujxhdftgu"
df3 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df3)
# [1] 5

r1 <- "ghuytut3xilujshdftgu"
r2 <- "ghuytuthjixujxhdftgu"
df4 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df4)
# [1] 1

A variation that reports both left and right counts. Note that the "center" is counted in both left and right, so sum of left + right is 1 greater than what reported by original function:

sim_len2 <- function(df, center=floor(ncol(df) / 2)) {
  dfs <- list(left=df[, max(center, 1):1, drop=F], right=df[, center:ncol(df), drop=F])
  vapply(dfs, 
    function(df) {
      diff <- cumsum(ifelse(df[1, ] == df[2, ], 1, NA_integer_))
      diff[max(which(!is.na(diff)))]
      },
      numeric(1L)
) }
sim_len2(df1)
# left right 
#   10    11
sim_len2(df4, 4)
# left right 
#    4     4 
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • @BroadieG that works ...but could you somehow implement that it doesn't start automatically in the centre of the string but rather at a given position? – user3069326 Jan 10 '14 at 08:02
  • @BroadieG can i replcae centre just with any random position? And is there an opportunity to display not only the final result but also how many to the left and to the right match? – user3069326 Jan 10 '14 at 08:15
  • @user3069326, I modified the code to add the optional `center` argument. Note this doesn't do any checking on whether your `center` is reasonable (i.e. within the # of cols of the `df`). If this works for you, please mark the q as answered, though I don't know if you can if it is on hold. – BrodieG Jan 10 '14 at 13:26
  • @BroadieG I got cross your code and if I am not mistaken it only checks how many differences are there between both rows? Is it possible to get something like 4 are identical to the right of the centre and 3 are identical to the left of the centre...ifyou see what i mean. Additionally, if i type sim_len(df1,4) where 4 is the center I get an error df[1,] incorrect number of dimensions – user3069326 Jan 10 '14 at 15:01
  • @user3069326, I cannot reproduce your error. When I run it I get the correct answer. Please clear your workspace, re-copy and paste the code to reload the functions and the data, re-run and let me know if you still get the error. – BrodieG Jan 10 '14 at 15:15
  • @user3069326, also, now added variation that reports left and right counts. – BrodieG Jan 10 '14 at 15:15
  • Does df1 NEED to be a data frame or could it also be a matrix..i suppose that might be the problem behing my df[1,]: incorrect number of dimensions..? – user3069326 Jan 10 '14 at 15:28
  • It did need to be a data frame, as per your question. Modified now so it works with both. Also, added a fix for the case were you select center==1 – BrodieG Jan 10 '14 at 15:50