The below R script computes the percentage similarity between two strings of text in columns "names1" and "names2". However, my requirement is to perform the same operation on 6k-10K+ column items. When the below Formula gets applied on such a big column, the solution goes for a toss as the count of line items goes in millions, and is not considered vital for enterprise delivery. Also along with the "percent" column, I need to put additional 6-7 other columns which will make the solution size above 1 GB. Kindly help me to update the script else a possible solution to achieve the same. Thanks a lot.
library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
names1 <- c("Adam Shaw","Justin Bose","Cydney Clide")
names2 <- c("Adam Shaw","Justin Bose","Cydney Clide")
names1 <- as.character(names1)
names2 <- as.character(names2)
Percent <- paste(round(unlist(lapply(1:length(names1), function(x) {
levenshteinSim(names1[x], names2[-x])}))*100, 1), "%", sep="")