Spark LDA woes - prediction and OOM questions

Question

I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA.

Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents (skeleton code). The resulting predictions were slow to generate (which I offered a fix for in SPARK-10809) but more worrisome, incoherent (topics/predictions). If a document's predominantly about football, I'd expect the "football" topic (topic 18) to be in the top 10.

Not being able to tell if something's wrong in my prediction code - or if it's because I was using the Distributed/EM-based model (as is hinted at here by jasonl here) - I decided to try the newer Local/Online model. I spent a couple of days tuning my 240 core/768GB RAM 3-node cluster to no avail; seemingly no matter what I try, I run out of memory attempting to build a model this way.

I tried various settings for:

driver-memory (8G)
executor-memory (1-225G)
spark.driver.maxResultSize (including disabling it)
spark.memory.offheap.enabled (true/false)
spark.broadcast.blockSize (currently at 8m)
spark.rdd.compress (currently true)
changing the serializer (currently Kryo) and its max buffer (512m)
increasing various timeouts to allow for longer computation (executor.heartbeatInterval, rpc.ask/lookupTimeout, spark.network.timeout) spark.akka.frameSize (1024)

At different settings, it seems to oscillate between a JVM core dump due to off-heap allocation errors (Native memory allocation (mmap) failed to map X bytes for committing reserved memory) and java.lang.OutOfMemoryError: Java heap space. I see references to models being built near my order of magnitude (databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), so I must be doing something wrong.

Questions:

Does my prediction routine look OK? Is this an off-by-one error somewhere w.r.t the irrelevant predicted topics?
Do I stand a chance of building a model with Spark on the order of magnitude described above? Yahoo can do it with modest RAM requirements.

Any pointers as to what I can try next would be much appreciated!

My guess is that the `OnlineLDAOptimizer` is not going to scale well with large dictionaries and larger numbers of topics (a consequence of the way the sufficient statistics are being aggregated.) Note that in the databricks blog post you referenced they are actually using the `EMLDAOptimizer`, which is quite different (for one, `EMLDAOptimizer` uses GraphX while `OnlineLDAOptimizer` is just based on Spark core.) — Jason Scott Lenderman, Feb 01 '16 at 04:46
Databricks published a blog post showing how the `OnlineLDAOptimizer` was to Wikipedia using 100 topics and a 10k word dictionary, but this is pretty small compared with what you're trying to do. — Jason Scott Lenderman, Feb 01 '16 at 04:57

Spark LDA woes - prediction and OOM questions

0 Answers0

Linked