Application of Text Mining in R on large dataset

Question

The below R script computes the percentage similarity between two strings of text in columns "names1" and "names2". However, my requirement is to perform the same operation on 6k-10K+ column items. When the below Formula gets applied on such a big column, the solution goes for a toss as the count of line items goes in millions, and is not considered vital for enterprise delivery. Also along with the "percent" column, I need to put additional 6-7 other columns which will make the solution size above 1 GB. Kindly help me to update the script else a possible solution to achieve the same. Thanks a lot.

library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
names1 <- c("Adam Shaw","Justin Bose","Cydney Clide")
names2 <- c("Adam Shaw","Justin Bose","Cydney Clide")
names1 <- as.character(names1)
names2 <- as.character(names2)
Percent <- paste(round(unlist(lapply(1:length(names1), function(x) { 
levenshteinSim(names1[x], names2[-x])}))*100, 1), "%", sep="")

HPC reading material: https://cran.r-project.org/web/views/HighPerformanceComputing.html / A Spark example (which can be adapted to R w/its spark connectors) http://aseigneurin.github.io/2016/02/22/record-inkage-a-real-use-case-with-spark-ml.html (HPC does not really lend itself to "copypasta") — hrbrmstr, Jan 13 '18 at 12:22

Antonios · Answer 1 · 2018-01-13T16:20:11.633

You may find Vectorization helpful:

#Create a large character Vector:
names1<-as.character(rep(iris$Species,10))

# Use Lapply
system.time(Percent <- paste(round(unlist(lapply(1:length(names1), function(x) { 
  levenshteinSim(names1[x], names1[-x])}))*100, 1), "%", sep=""))

#Create Vectorized Function
fun1<-function(names,x) {
  return(levenshteinSim(names[x],names[-x]))
}

vecFun1<-Vectorize(fun1,vectorize.args = "x")


#Execute Vectorized Function
system.time(percentVec<-vecFun1(names1,c(1:length(names1))))
percentVec<-paste(as.character(round(c(percentVec)*100,1)),"%",sep="")

Here is the code execution, vectorization takes less than 1/3 of the time

> names1<-as.character(rep(iris$Species,10))
> system.time(Percent <- paste(round(unlist(lapply(1:length(names1), function(x) { 
+   levenshteinSim(names1[x], names1[-x])}))*100, 1), "%", sep=""))
   user  system elapsed 
   5.07    0.02    5.09 
> 
> fun1<-function(names,x) {
+   return(levenshteinSim(names[x],names[-x]))
+ }
> 
> vecFun1<-Vectorize(fun1,vectorize.args = "x")
> 
> system.time(percentVec<-vecFun1(names1,c(1:length(names1))))
   user  system elapsed 
   1.62    0.00    1.62

I also use your example with character vector of 3 elements

> names2<-c("Adam Shaw","Justin Bose","Cydney Clide")
> names2 <- as.character(names2)
> system.time(Percent <- paste(round(unlist(lapply(1:length(names2), function(x) { 
+   levenshteinSim(names2[x], names2[-x])}))*100, 1), "%", sep=""))
   user  system elapsed 
      0       0       0 
> 
> fun1<-function(names,x) {
+   return(levenshteinSim(names[x],names[-x]))
+ }
> 
> vecFun1<-Vectorize(fun1,vectorize.args = "x")
> 
> system.time(percentVec<-vecFun1(names2,c(1:length(names2))))
   user  system elapsed 
      0       0       0 
> 
> percentVec<-paste(as.character(round(c(percentVec)*100,1)),"%",sep="")
> 
> Percent
[1] "9.1%"  "16.7%" "9.1%"  "16.7%" "16.7%" "16.7%"
> percentVec
[1] "9.1%"  "16.7%" "9.1%"  "16.7%" "16.7%" "16.7%"

Thank you so much for replying, I am trying your script, but as requested, I am not able to check the output percentages using vecFun1, kindly help. — Adam Shaw, Jan 13 '18 at 12:40
Hi Adam,the results should be in percentVec variable. Try str(percentVec) or View(percentVec) — Antonios, Jan 13 '18 at 13:07
the percentVec is giving decimal value, the formula which I gave in my question is giving in percentage value. Kindly check. — Adam Shaw, Jan 13 '18 at 14:36
i added one line of code so percentVec is now a character vector with the % notation. I used your names2 variable which is too small to show how faster vectorization is over lapply. — Antonios, Jan 13 '18 at 16:27
please help me with this post, I am struggling to find a fix, https://stackoverflow.com/questions/48394545/usage-of-uioutput-in-multiple-menuitems-in-r-shiny-dashboard/48394758?noredirect=1#comment83778189_48394758 — Adam Shaw, Jan 23 '18 at 06:26

Application of Text Mining in R on large dataset

1 Answers1