It's certainly true that the default JVM heap size is probably going to have to be increased. I doubt greatly that using split or any other RE-based approach is going to be tractable for that large an input. Likewise you're going to see an excessive increase in memory requirements if you convert the input to a List[Char]
to exploit the wonderful collections library; the size inflation will be minimally a decimal order of magnitude.
Given the relatively simple decomposition (words separated by white-space or punctuation) I think a more prosaic solution may be necessary. Iterate imperatively over the characters of the string (but not via an implicit conversion to any kind of Seq[Char]
) and find the words, dumping them into a mutable.Set[String]
. That will eliminate duplicates, for one thing. Perhaps use a Buffer[Char]
to accumulate the characters of each word before turning them into a String
to be added to the Set[String]
.
Here's a cut at it:
package rrs.scribble
object BigTextNLP {
def btWords(bt: String): collection.mutable.Set[String] = {
val btLength = bt.length
val wordBuffer = collection.mutable.Buffer[Char]()
val wordSet = collection.mutable.Set[String]()
/* Assuming btLength > 0 */
import bt.{charAt => chr}
import java.lang.Character.{isLetter => l}
var inWord = l(chr(0))
(0 until btLength) foreach { i =>
val c = chr(i)
val lc = l(c)
if (inWord)
if (lc)
wordBuffer += c
else {
wordSet += wordBuffer.mkString
wordBuffer.clear
inWord = false
}
else
if (lc) {
inWord = true
wordBuffer += c
}
}
wordSet
}
}
In the REPL:
scala> import rrs.scribble.BigTextNLP._
import rrs.scribble.BigTextNLP._
scala> btWords("this is a sentence, maybe!")
res0: scala.collection.mutable.Set[String] = Set(this, maybe, sentence, is, a)