0

I've been considering scikit-learn and spark for a ML project where I need to clasify words into two categories.

I run spark with local[*] and the session is created in Java.

I am surprised on how fast is scikit compared to spark running locally for small input batches. Spark scales better, it takes roughly the same time to label 1 word than 100 but scikit is still faster for small datasets.

Is there any way to tune spark so it can perform better on small input datasets? I cannot create a buffer until I have enough words.

Thanks.

Raúl García
  • 315
  • 3
  • 17
  • 2
    Tuning Spark is generally geared toward what you are doing in the code you are running. A lot of information can be found in the Spark UI. Is there a lot of shuffling? Do you have too many or too few partitions? Is there a particular stage that this slow? etc... Unfortunately there usually isn't just one thing that can be done to speed up Spark. Also keep in mind that Spark shines when it comes to huge data sets not necessarily when it comes to speed on a smaller ones. Look here for information on tuning Spark: http://spark.apache.org/docs/latest/tuning.html – Jeremy May 16 '17 at 17:24
  • 1
    Spark is not good for small datasets. I suggest you to compare the Spark vs SkLearn with a dataset of 10Gb and see what happens. – Thiago Baldim May 16 '17 at 19:10

0 Answers0