0

I have two dataframes in R - TABLEA (~6,000 rows) and TABLEB (~30,000 rows). All columns are character format.

TABLEA: COL1 - COL2 - COL3 - COL4

TABLEB: COL1 - COL2 - COL3 - COL4 - COL5 - COL6 - COL7

I would like to add a 5th column to TABLEA by taking the values of COL1-COL4 for each row in TABLEA and performing a regex match on TABLEB for the same columns. The value of COL5 from TABLEB should then be added to the appropriate row of TABLEA at COL5.

I have written a function to do this and tried a fop-loop and mapply technique but it is taking a very long time to run. I suspect that my function to perform the lookup is terribly inefficient and that there must be a better way to do this.

Any helpful suggestions or tips on how I can achieve this?

getValue<-function(input1, input2, input3, input4) 
{
  #filter TABLEB on 4 columns matching the 4 inputs above.
  FILTERED <- TABLEB %>% filter(str_detect(input1, COL1)) %>% filter(str_detect(input2, COL2)) %>%  
    filter(str_detect(input3, COL3)) %>% filter(str_detect(input4, COL4)) %>% 
    #select the 1st value from column 5, in case of multiple matches
    select(COL5) %>%  head(1)

  VALUE<-as.character(FILTERED$COL5)

  #if no match found
  if(length(VALUE) == 0) VALUE = "UNKNOWN"

  return(VALUE)
}

#for-loop method
for(i in 1:nrow(TABLEA)) {
  TABLEA[i, ]$COL5<-getValue(TABLEA[i, ]$COL1, TABLEA[i, ]$COL2, TABLEA[i, ]$COL3, TABLEA[i, ]$COL4)  

#mapply method
TABLEA$COL5<-mapply(getValue, TABLEA$COL1, TABLEA$COL2, TABLEA$COL3, TABLEA$COL4)
}

djx99me
  • 105
  • 1
  • 6
  • 1
    Please consider a small reproducible example – akrun Mar 13 '20 at 17:02
  • 1
    You should provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) which is [minimal, but complete and verifiable example](https://stackoverflow.com/help/minimal-reproducible-example) including the errors (if any). Your question should be clear and specific. Take the [tour](https://stackoverflow.com/tour) and read [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) – M-- Mar 13 '20 at 17:02
  • ```library(data.table); setDT(TABLEA)[setDT(TABLEB), COL5:= i.COL5, on = c(COL1="COL1", COL2="COL2", COL3="COL3", COL4="COL4" )]``` – M-- Mar 13 '20 at 17:12
  • Thank M--. Is there a way to update your code to work with regex matching? The Columns in TABLEA are fixed strings and the columns in TABLEB are regex patterns. For example TABLEA$COL1 = "Value", TABLEB$COL1 = "Va[lmn]ue" – djx99me Mar 13 '20 at 17:23
  • 1
    You can. use `names` – akrun Mar 13 '20 at 17:25
  • 1
    if you follow the links I shared, and provide an example, I will post an answer that would address your specific problem. But using `names` as @akrun suggested is a way to go. – M-- Mar 13 '20 at 19:54

0 Answers0