-1

I am trying to standardize a given string using a defined set of rules. These rules have been formalized using several "gsub" operations which are stored in a data frame (but is being called as an atomic vector using $) in plain-text.

I have 4 separate data-frames populated with the strings I want to standardize. I have implemented a for loop which works, however, it involves rewriting the gsub operations for each data frame and is quite time-consuming to run.

While I am aware that apply doesn't provide any real speedup over a for loop unless a compiled function is called, I am in need of an abstracted method to run this standardization over several data-frames (as there will be more in the future).

In order to achieve this generalization, I tried writing a nested apply structure. I am evaluating the gsub operations within the function call from apply using "eval(parse(text = x))". I want to iterate this apply call over the elements of the data frame with strings stored for standardization, hence the higher nested apply.

I am expecting the apply to loop over all operations and apply them sequentially to a string, all the while looping over the string data frame itself. However,this is clearly not working. It throws the output:

library(data.table)
library(stringi)

repdf <- data.table(Names = c("Palmolive Co. Pvt. Ltd.","Hellenic P. Co.","Freeman's Consortium pvt. ltd."),Address =c("15, Parkway Broadsite, Mumbai","Greco-Roman Architecture Street, Pune","1-B,Black Mesa Compound, Crowbar Street, Delhi."))
gsubop_df <- data.table(Commands = c('"stri_replace_all_regex(x, "Co\\b\\.?","Company")"','"stri_replace_all_regex(x, "\\(P\\.\\)$","Private Limited")"','"stri_replace_all_regex(x, "Corpn\\b\\.?","Corporation")"'))

repdf$Names <- apply(repdf[,1],2,function(x) apply(gsubop_df,2,eval(parse(text = as.character(x)))))
#> Error in parse(text = as.character(x)): <text>:1:11: unexpected symbol
#> 1: Palmolive Co.
#>       

As I mentioned before, I wrote a for loop which works:

name_rule_length <- length(name_clean_rules_apply$Commands)
for(i in 1:nrow(mh_rules_nme)){
MG$Name <- eval(parse(text= mh_rules_nme[i,]))
}

An example of the gsub operation in mh_rules_nme:

stri_replace_all_regex(MG$Name,"M(?:\\|\\/)s","")

This, however, requires me to rewrite the gsub operation for every data frame, whereas I am looking to achieve the same function using a generic "x" from within apply.

However, when I do an atomic eval(parse), it runs fine. Within the looping operation, though, this error is thrown.

Any help in resolving this is much appreciated.

  • 2
    It would help if you provided sample data and all relevant code. For instance, you referenced a loop, are you using a literal `for` or `while` loop, or is that meant to reference your `apply`? For help providing a *reproducible* question, please look at https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. – r2evans Feb 11 '19 at 06:32
  • I was in fact mentioning a for loop, thank you for pointing out the lack of clarity. I have edited the question and provided the for loop code, however, there isn't much code to be added otherwise, as both the modification data frame and the operations data frame are imported, member examples of which I provided. I have additionally mentioned the libraries I am using, in case they might be creating a conflict I am not aware of. – singular_instruction Feb 11 '19 at 06:48
  • 1
    You should never, ever use `eval(parse())`. The error seems to indicate that you are trying to parse `"arofine polymers"` which is obviously not valid R syntax. – Roland Feb 11 '19 at 06:58
  • Hello Roland, shouldn't the `eval(parse())` evaluate `text= "stri_replace_all_regex(x, "\\(P\\.\\)","Private")"` as `expression(stri_replace_all_regex("arofine polymers","\\(P\\.\\)","Private"))`? Also, can you tell me why `eval(parse()) ` should never be used? Did you mean it in the context of my code or in general? – singular_instruction Feb 11 '19 at 07:08
  • I mean in general. Code using it is slow and difficult to debug and maintain. It's also not necessary to create R commands as strings and parse them. There is a better way and if you provided a [**minimal** and **reproducible** example](https://stackoverflow.com/a/5963610/1412059) we could show it to you. I don't think you are passing to `parse` what you believe you are passing to it. – Roland Feb 11 '19 at 07:44
  • Thanks Roland, I have updated the question with a minimal reprex and removed the unnecessary bits. And yes, I suspect I am not fully comprehending the parse argument. You mentioned a better way, would you kindly elaborate? – singular_instruction Feb 11 '19 at 08:07

1 Answers1

0

I don't fully understand why you are using apply to loop over a one column data.table. You should use it with margin = 2 only for matrices. For data.frames/data.tables you should use lapply instead. Anyway, all of this is unnecessary since stri_replace_all_regex is vectorized:

gsubop_df <- data.table(regex = c( "Co\\b\\.?", "P\\.", "Corpn\\b\\.?"), #changed slightly for illustration
                        replacement = c("Company", "Private Limited", "Corporation"))


stri_replace_all_regex(repdf[,Names], replacement = gsubop_df$replacement,
                       pattern = gsubop_df$regex, vectorize_all = FALSE)
#[1] "Palmolive Company Pvt. Ltd."      "Hellenic Private Limited Company" "Freeman's Consortium pvt. ltd."  
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Hey Roland, thanks for the guidance. Your code runs smoothly, I tried changing the details (such as company names and regexes) just to ensure applicability, and it worked. However, when I apply the same code to the larger dataset, using the same syntax, the same rules and the same data-structure, it returns the Name Column as it is, unchanged. I have been trying to figure out what's going wrong, here is the syntax, taken verbatim. `gm_darg_subco$CL_Name_STD <- stri_replace_all_regex(str = gm_darg_subco[,Name_Modifier], replacement = fnc$Rep_1,pattern = fnc$Pat_1, vectorize_all = FALSE)` – singular_instruction Feb 11 '19 at 12:46
  • Also, it gives this additional warning: `Warning message: In stri_replace_all_regex(str = gm_darg_subco[, Name_Modifier], : argument is not an atomic vector; coercing` – singular_instruction Feb 11 '19 at 12:52
  • Try `gm_darg_subco[[Name_Modifier]]`. You should probably read the data.table vignettes. – Roland Feb 11 '19 at 13:00