There is nothing specifically wrong with your code when it comes to concurrency:
- The distributed and parallel part is model fitting process
ml_kmeans(...)
. For loop doesn't affect that. Each model will be trained using resources available on your cluster as expected.
The outer loop is a driver code. Under normal conditions we use standard single threaded code (not that multithreading at this level is really an option R) when working with Spark.
In general (Scala, Python, Java) it is possible to use separate threads to submit multiple Spark jobs at the same time, but in practice it requires a lot of tuning, and access to low level API. Even there it is rarely worth the fuss, unless you have significantly overpowered cluster at your disposal.
That being said please keep in mind that if you compare Spark K-Means to local implementations on a data that fits in memory, things will be relatively slow. Using randomized initialization might help speed things up:
ml_kmeans(id_ip4, centers = i, init_mode = "random",
seed = 1234, features_col = colnames(id_ip4))
On a side note with algorithms, which can be easily evaluated with one of the available evaluators (ml_binary_classification_evaluator
, ml_multiclass_classification_evaluator
, ml_regression_evaluator
) you can use ml_cross_validator
/ ml_train_validation_split
instead of manual loops (see for example How to train a ML model in sparklyr and predict new values on another dataframe?).