1

I would like to make a function that takes a list of strings and matches them with strings in a data frame containing multiple (>100) columns. All matching strings are then returned into a new dataframe. In other words the function starts looking for string1 in col1 and stops when found. In case of string4 doesn't find it in col1, but in col2, so it returns string2 only. In case of string 3 it stops, but string6 returns string3.

strings<-c("string1", "string2", "string3", "string4", "string5", "string6")

df

col1    col2    col3
string1 x   x
string2 string4 x
string3 string5 string6

dput(df)

structure(list(col1 = structure(1:3, .Label = c("string1 x   x", 
"string2 string4 x", "string3 string5 string6"), class = "factor"), 
    col2 = c(NA, NA, NA), col3 = c(NA, NA, NA)), .Names = c("col1", 
"col2", "col3"), class = "data.frame", row.names = c(NA, -3L))

df_new

col_combo
string1
string2
string3
string2
string3
string3
user2904120
  • 416
  • 1
  • 4
  • 18
  • 2
    [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269) – Sotos Nov 03 '17 at 12:36
  • 2
    Difficult to understand – Henry Navarro Nov 03 '17 at 12:40
  • I hope it's better now – user2904120 Nov 03 '17 at 13:15
  • 1
    No idea what you want to do. – Sotos Nov 03 '17 at 13:44
  • I still do not fully understand what your desired output is, but I guess that if you should have a look at `lapply()` (to be used on each string), `apply(df, 2, ...)` (to be applied on each column), `grepl()` (to check if string is present) and `any()` (to check if any of the matches is `TRUE`), then you would have to specify the return value, which I don`t know what it should be from your description. – Manuel Bickel Nov 03 '17 at 13:46
  • I start with a list of strings and one df, and want to return a rows containing strings from the list. The only tricky part is that strings that match other columns, should still return col1. Like with synonyms in dictionary. – user2904120 Nov 03 '17 at 13:52

1 Answers1

1

The hint in your last comment that you are looking for a function to find synonyms was helpful to provide an answer. The following answer is certainly not the most elegant and also not the fastest. If you have a large number of synonyms to find, you might have a look at solutions with data.table, which is the fastest package for such lookup tasks. However, to get you going, I will only use base R. (Please note, I have rewritten your data, since the structure of your df did not make sense to me, I hope that my understanding of your data is correct).

UPDATE: Introduced additional options based on search in collapsed rows, one with 'stringi' package, and a data.table appraoch. Also added a "string7" that will not be matched. Furthermore, introduced benchmarks using microbenchmark. My data.tableapproach seems not the best, furthermore, the package comes with certain overhead for small data sets, only with larger data sets its advantage in speed will be evident. You might provide a larger reproducible example, maybe based on the functions used in this question link for a better comparison. However, it should be noted, that data.table may only be faster for exact matching, check ?%like%. The other functions I tested would have to be adapted for a fair comparison, maybe also using the package fmatch.

strings <- c("string1", "string2", "string3", "string4", "string5", "string6", "string7")

df <- read.table(text = "col1    col2    col3
                 string1 x   x
                 string2 string4 x
                 string3 string5 string6"
                 ,stringsAsFactors = F,header =T)


find_synonyms <- function(df, strings) {

  sapply(strings, function(x) {

    synonyms <- apply(df, 1, function(y) {
      #you could also use match()
      #grep() allows partial matching if needed
      if(any(grepl(paste0("^",x,"$"), y))) {           
        y[1]            
      }  else {            
        NA            
      }       
    })
    synonyms[!(is.na(synonyms))]
  }) 
}


find_synonyms_collapse_rows_grepl <- function(df, strings) {

  synsets <- apply(df, 1, paste, collapse = " ")
  names(synsets) <- df[,1]

  sapply(strings, function(x) {

    names(synsets)[grep(paste0("\\b", x ,"\\b"), synsets, perl=T)]  

  })
}


library(stringi)
find_synonyms_collapse_rows_stringi <- function(df, strings) {

  synsets <- apply(df, 1, paste, collapse = " ")
  names(synsets) <- df[,1]

  sapply(strings, function(x) {

    names(synsets)[stri_detect_regex(synsets, paste0("\\b", x ,"\\b"))]  

  })
}


library(data.table)
find_synonyms_DT_reshape_like <- function(df, strings) {

  df <- as.data.table(df)
  df[ , mainsynonym := col1]
  df <- melt(df, id.vars = "mainsynonym")
  setkey(df, value)

  sapply(strings, function(x) {

    df[value %like% x, mainsynonym]

  })
}

find_synonyms_DT_matchkey <- function(df, strings) {

  df <- as.data.table(df)
  df[ , mainsynonym := col1]
  df <- melt(df, id.vars = "mainsynonym")
  setkey(df, value)

  sapply(strings, function(x) {

    df[value == x , mainsynonym]

  })
}


results_list <- list(unlist(find_synonyms(df, strings)),
                     unlist(find_synonyms_collapse_rows_grepl(df, strings)),
                     unlist(find_synonyms_collapse_rows_stringi(df, strings)),
                     unlist(find_synonyms_DT_reshape_like(df, strings)),
                     unlist(find_synonyms_DT_matchkey(df, strings))
                     )

sapply(results_list, function(x) {

  sapply(results_list, function(y) {

    identical(x,y)
  })

}
)
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE TRUE TRUE TRUE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE TRUE TRUE
# [4,] TRUE TRUE TRUE TRUE TRUE
# [5,] TRUE TRUE TRUE TRUE TRUE


library(microbenchmark)

microbenchmark(
find_synonyms(df, strings),
find_synonyms_collapse_rows_grepl(df, strings),
find_synonyms_collapse_rows_stringi(df, strings),
find_synonyms_DT_reshape_like(df, strings),
find_synonyms_DT_matchkey(df, strings)
)

# Unit: microseconds
# expr      min        lq      mean    median        uq       max neval
# find_synonyms(df, strings)  719.624  848.5085 1129.4298  987.6565 1058.9080  9290.361   100
# find_synonyms_collapse_rows_grepl(df, strings)  660.017  738.1770  952.7571  794.8230  839.4295 16998.577   100
# find_synonyms_collapse_rows_stringi(df, strings)  223.428  265.8625  364.4979  302.9685  344.4170  5798.433   100
# find_synonyms3_DT_reshep_like(df, strings) 3259.029 3643.9060 3900.1955 3800.8180 4102.7985  5883.303   100
# find_synonyms_DT_matchkey(df, strings) 4710.135 4907.9040 5428.8650 5279.5595 5630.8855  8450.769   100
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • would it make sense to convert df to matrix to make matching function a bit faster, I tried it for a large dataset and it takes a while – user2904120 Nov 05 '17 at 23:58
  • Swapping 'string2' in df with 'string7', that does not exist in the list, changes the output. It would be nice if the function could skip non-matching strings. – user2904120 Nov 06 '17 at 00:15
  • Changing to matrix would not significantly speed up the function from my perspective, however, using additional packages like `stringi` / `stringr` and `data.table`would. Furthermore, the internal data structure that is created in the function and assigning it to a variable makes it slow. I´ll have a look at this later. – Manuel Bickel Nov 06 '17 at 07:59
  • Concerning strings that are not matched, you receive a `logical(0)` as output. From my perspective this is a reasonable output. You could change this to NA or something appropriate. The line `synonyms[!(is.na(synonyms))]`would have to be adapted to return such value if the retrun of the vector accessing is empty. – Manuel Bickel Nov 06 '17 at 08:02
  • 1
    You might also consider to draw higher attention to your issue by renaming the title of your question. You might consider something like "How to do fast synonym lookup and return synonym of highest order from lookup table?". I am not an expert with `data.table`, I`ll have a look anyway, but probably others might propose faster solutions. – Manuel Bickel Nov 06 '17 at 08:06
  • 1
    Check the update, `stringi` and collapsing rows makes it faster. Sorry, `data.table` solution still pending. – Manuel Bickel Nov 06 '17 at 09:02