Scala Large Text file

Question

I'm a newbie with Scala programming.

I have to deal with an NLP task.

I'm having trouble with processing a large text file in Scala.

I have read the entire text of a 100+ M.B file onto memory (into a string) and have to process it (I believe processing large text files is a common task in Natural Language Processing).

The goal is to count the number of unique substrings/words in the given string (which is the entire file).

I wanted to use "distinct" method in List object, but converting the string into a list using ".split" method raises out of memory error ("java.lang.OutOfMemoryError: Java heap space" Error).

I was wondering if I could accomplish this task without using lists using String or Regular Expression methods in Scala?

Relevant: http://stackoverflow.com/questions/4255021/how-do-i-read-a-large-csv-file-with-scala-stream-class — Brian, Jan 24 '13 at 00:25
While you will definitely run into issues if the file gets really large, depending on how much ram you have you can try some of the suggestions here to increase the memory available to the JVM: http://stackoverflow.com/questions/1441373/increase-jvm-heap-size-for-scala — jcern, Jan 24 '13 at 00:31
I do some similar work using a 200+ MB file to train a multilayer neural net, and it's just a fact of life that you have to raise the default memory available. I have plenty of memory on my laptop so I typically use these command line arguments for the JVM: -Xmx6g -XX:MaxPermSize=256m — Noah, Jan 24 '13 at 01:29
Raul, please give an example of what you are trying to solve, i.e. sample input string and sample output. e.g. "Twinkle twinkle little star" as input must give ("twinkle" -> 2, "little" -> 1, "star" -> 1) as output. — Jack, Jan 24 '13 at 06:37
I raised, the default heap size using the options "scala -J-Xmx2048m FileName.scala". Now, out of memory errors have been resolved. Thank you everyone. — garak, Jan 24 '13 at 09:03
@JacobusR: Yes. That is the task I'm trying to acheive. Is there a faster way to do it without using any NLP packages. — garak, Jan 24 '13 at 09:05

Randall Schulz · Accepted Answer · 2013-01-24T02:58:20.287

It's certainly true that the default JVM heap size is probably going to have to be increased. I doubt greatly that using split or any other RE-based approach is going to be tractable for that large an input. Likewise you're going to see an excessive increase in memory requirements if you convert the input to a List[Char] to exploit the wonderful collections library; the size inflation will be minimally a decimal order of magnitude.

Given the relatively simple decomposition (words separated by white-space or punctuation) I think a more prosaic solution may be necessary. Iterate imperatively over the characters of the string (but not via an implicit conversion to any kind of Seq[Char]) and find the words, dumping them into a mutable.Set[String]. That will eliminate duplicates, for one thing. Perhaps use a Buffer[Char] to accumulate the characters of each word before turning them into a String to be added to the Set[String].

Here's a cut at it:

package rrs.scribble

object  BigTextNLP {
  def btWords(bt: String): collection.mutable.Set[String] = {
    val btLength = bt.length
    val wordBuffer = collection.mutable.Buffer[Char]()
    val wordSet = collection.mutable.Set[String]()

    /* Assuming btLength > 0 */

    import bt.{charAt => chr}
    import java.lang.Character.{isLetter => l}

    var inWord = l(chr(0))

    (0 until btLength) foreach { i =>
      val c = chr(i)
      val lc = l(c)

      if (inWord)
        if (lc)
          wordBuffer += c
        else {
          wordSet += wordBuffer.mkString
          wordBuffer.clear
          inWord = false
        }
      else
        if (lc) {
          inWord = true
          wordBuffer += c
        }
    }

    wordSet
  }
}

In the REPL:

scala> import rrs.scribble.BigTextNLP._
import rrs.scribble.BigTextNLP._

scala> btWords("this is a sentence, maybe!")
res0: scala.collection.mutable.Set[String] = Set(this, maybe, sentence, is, a)

+1 prosaic - Cool word, and relates well to solving real problems ;-) — Jack, Jan 24 '13 at 06:33
You can also filter out stop words just before accumulating into the `Set` as well as canonicalize case. There's a bigger problem with this code, namely that contractions (e.g., "don't") will get split (into "don" and "t"). Caveat Programmer! — Randall Schulz, Jan 24 '13 at 16:33

score 2 · Answer 2 · answered Jan 24 '13 at 09:07

I assume, that you have your File as a List[String] in memory and every entry in the List is a line of the File.

val textStream = text.toStream
val wordStream = textStream.view.flatMap(s => s.split(" "))
val distinctWordStream = wordStream.foldLeft(Stream.empty[String])((stream, string) =>
  if (stream.contains(string)) stream else string #:: stream
)

Firstly you create a Stream, so you don't have to deal with the whole String. The next step is creating a View and maping it, so you have only one word in every String instead of one line. Last you fold the result word by word. If a word is contained yet, it will be droped. Instead of folding you could also use this line:

val wordSet = wordStream.toSet

Getting the number of distinct words should be trivial at this point. You only have to call length or size for the Set.

score 1 · Answer 3 · answered Jan 24 '13 at 19:45

1

Have a look at this blog that discusses your problem and different approaches to it.

answered Jan 24 '13 at 19:45

Jack

16,506
19
100
167

That was helpful. But too late. :) – garak Jan 25 '13 at 11:54

Scala Large Text file

3 Answers3