1

I'm trying to remove stopwords from a big dataframe in R (12M rows). I tried executing it to a 30k-row data frame and it works perfectly (it is done within 2 min) .for a 300k-row data frame it takes too much time (for about 4 hours) but I need to execute it for a 12m-row data frame, I just want to know if there's another way to do this (maybe the loop causes the slowdown)

trait_text function is defined in the code area and removeWords is a pre-defined R function that remove stopwords from a varchar.

Another question in the same context : Do I need to migrate to RStudio 64-bit ? because the 32 bit version is not using all the RAM available on the machine.

#define stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU",     "SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))


##trait text :

#Remove Multiple spaces
del_multispace = function(text) {
  return(text <- gsub("\\s+", " ", text))
}

#Remove Ponctuation
del_punctuation = function(text) {
  text <- gsub("[[:punct:]]", "", text)
}

#Remove accents 
del_accent = function(text) {
  text <- gsub("['`^~\"]", " ", text)
  text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
  text <- gsub("['`^~\"]", "", text)
  return(text)
}


trait_text=function(text) {

  text = del_multispace(text)
  text = del_punctuation(text)
  text = del_accent(text)

}

#remove stopwords for data :
system.time(for (i in 1:nrow(test_data)) {

  print(paste("client n: ",i))
  x<-removeWords(trait_text(test_data$ref[i]),stop)


  #output
  test_data$ref[i]<-gdata::trim(paste(x, collapse = ' '))

})

Sample test_data with desired output :


      ref        ouptut 
1 |"LE LA ONE" | "ONE"
2 |"SAS TWO"   | "TWO"
3 |"MR THREE"  | "THREE"
Amine96
  • 65
  • 6
  • 2
    Make the question reproducible: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . Also: what packages are the functions trait_text and removeWords from? –  Jun 03 '19 at 12:36
  • Also, 32-bit RStudio can run a 64-bit R session. RStudio is just the front-end. It is not doing the heavy lifting. When you start RStudio, it will tell you whether you are using 32- or 64-bit R in the text at the top of the console. – Andrew Jun 03 '19 at 12:48
  • @schwantke I updated the post for trait_text and removeWords – Amine96 Jun 03 '19 at 12:50
  • 1
    You do not need to loop anything with `gsub`. Simply run it on the entire data frame column. Would help to see `removeWords()` as it too can be vectorized. – Parfait Jun 03 '19 at 12:54
  • @Amine96: You did not provide a sample of test_data –  Jun 03 '19 at 13:04
  • @schwantke you can find it in the code section – Amine96 Jun 03 '19 at 13:39
  • @Parfait I can't use gsub with a stopword's vector – Amine96 Jun 03 '19 at 13:40
  • @Amine96: running the code gives me "object 'test_data' not found" –  Jun 03 '19 at 14:14

1 Answers1

0

I figured out a solution to my question that works perfectly avoiding loops.

Code below :


library(tm)
library(gdata)


#stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU","SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))


#Remove multiple spaces
del_multispace = function(text) {
  return(text <- gsub("\\s+", " ", text))
}

#Remove punctuation 
del_punctuation = function(text) {
  return(text <- gsub("[[:punct:]]", "", text))
}

#Remove accents
del_accent = function(text) {
  text <- gsub("['`^~\"]", " ", text)
  text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
  text <- gsub("['`^~\"]", "", text)
  return(text)
}

#remove stopwords from text
del_stopwords=function(text) {

  text<-removeWords(text,stop)
  return(text)
}


#Cleaning function :
trait_text=function(text) {

  text = del_multispace(text)
  text = del_punctuation(text)
  text = del_accent(text)
  text = del_stopwords(text)
}


#remove stopwords from test_data:

system.time(test_data$x<-trim(trait_text(test_data$ref)))
Amine96
  • 65
  • 6