2

i want to know the efficient way to remove the stop words from huge text corpus. currently my approach is to convert stopword in to regex match the lines of text with regex and remove it.

e.g

String regex ="\\b(?:a|an|the|was|i)\\b\\s*";
 String line = "hi this is regex approach of stop word removal";
 String lineWithoutStopword = line.replaceAll(regex,"");

Is there and other efficient approach present to remove stopwords from huge corupus.

thanks

nat
  • 557
  • 2
  • 11
  • 25
  • 1
    With a "huge" text file, the speed of processing is going to be mostly determined by how quickly you can read and process the file. Tweaking the regex is unlikely to make any significant difference. To check, copy the input file to an output file, without processing and see how long it takes. You aren't going to be able to process the file faster than that. – rossum Apr 11 '15 at 11:53

1 Answers1

4

Using Spark, one way would be to subtract the stop words from the text after it has been tokenized in words.

val text = sc.textFile('huge.txt')
val stopWords = sc.textFile('stopwords.txt')
val words = text.flatMap(line => line.split("\\W"))
val clean = words.subtract(stopwords)

If you need to process very large files of text (>>GBs) it will be more efficient to treat the set of stopwords as an in-memory structure that can be broadcasted to each worker.

The code would change like this:

val stopWords = sc.textFile('stopwords.txt')
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
val words = text.flatMap(line => line.split("\\W"))
val clean = words.mapPartitions{iter =>
    val stopWordSet = stopWordSetBC.value
    iter.filter(word => !stopWordSet.contains(word))
}

Note that normalization of words of the original text will be necessary for this to work properly.

maasg
  • 37,100
  • 11
  • 88
  • 115