I'm trying to remove stopwords from a big dataframe in R (12M rows). I tried executing it to a 30k-row data frame and it works perfectly (it is done within 2 min) .for a 300k-row data frame it takes too much time (for about 4 hours) but I need to execute it for a 12m-row data frame, I just want to know if there's another way to do this (maybe the loop causes the slowdown)
trait_text function is defined in the code area and removeWords is a pre-defined R function that remove stopwords from a varchar.
Another question in the same context : Do I need to migrate to RStudio 64-bit ? because the 32 bit version is not using all the RAM available on the machine.
#define stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU", "SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))
##trait text :
#Remove Multiple spaces
del_multispace = function(text) {
return(text <- gsub("\\s+", " ", text))
}
#Remove Ponctuation
del_punctuation = function(text) {
text <- gsub("[[:punct:]]", "", text)
}
#Remove accents
del_accent = function(text) {
text <- gsub("['`^~\"]", " ", text)
text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
text <- gsub("['`^~\"]", "", text)
return(text)
}
trait_text=function(text) {
text = del_multispace(text)
text = del_punctuation(text)
text = del_accent(text)
}
#remove stopwords for data :
system.time(for (i in 1:nrow(test_data)) {
print(paste("client n: ",i))
x<-removeWords(trait_text(test_data$ref[i]),stop)
#output
test_data$ref[i]<-gdata::trim(paste(x, collapse = ' '))
})
Sample test_data with desired output :
ref ouptut
1 |"LE LA ONE" | "ONE"
2 |"SAS TWO" | "TWO"
3 |"MR THREE" | "THREE"