I've been considering scikit-learn and spark for a ML project where I need to clasify words into two categories.
I run spark with local[*] and the session is created in Java.
I am surprised on how fast is scikit compared to spark running locally for small input batches. Spark scales better, it takes roughly the same time to label 1 word than 100 but scikit is still faster for small datasets.
Is there any way to tune spark so it can perform better on small input datasets? I cannot create a buffer until I have enough words.
Thanks.