I have access to a computer with multiple CPU cores (i.e., 56) and when training models using Tensorflow I would like to make the maximum usage of the aforementioned cores, by making each one of the cores an independent trainer of the model.
In Tensorflow's documentation, I found these two parameters (Inter and Intra Op parallelism) that control the parallelism while training models. However, these two parameters do not allow to perform what I intend.
How can I make each core an independent worker? (i.e., batches of samples are sharded by each one of the workers, and then each worker computes gradients based on the samples that were assigned. Finally, each worker updates the variables (which are shared by all the workers) according to the gradients it has calculated.