Questions tagged [distributed-training]
83 questions
4
votes
0 answers
Issues when using HuggingFace `accelerate` with `fp16`
I'm trying to use accelerate module to parallelize my model training. But I have troubles to use it when training models with fp16. If I load the model with torch_dtype=torch.float16, I got ValueError: Attempting to unscale FP16 gradients.. But if I…

weiqis
- 41
- 1
- 3
3
votes
1 answer
On batch size, epochs, and learning rate of DistributedDataParallel
I have read these threads [1] [2] [3] [4], and this article.
I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate.
Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the…

Simon
- 5,070
- 5
- 33
- 59
3
votes
1 answer
tf.data vs tf.keras.preprocessing.image.ImageDataGenerator
I was reading about different techniques to load large data efficiently. The tf.data seems to perform well as compared to tf.keras.preprocessing.imageImageDataGenerator.
To what I know is, tf.data uses CPU pipelining to efficiently load the data…

superduper
- 401
- 1
- 5
- 16
3
votes
1 answer
Distributed training over local gpu and colab gpu
I want to fine tune ALBERT.
I see one can distribute neural net training over multiple gpus using tensorflow: https://www.tensorflow.org/guide/distributed_training
I was wondering if it's possible to distribute fine-tuning across both my laptop's…

Gog
- 93
- 6
2
votes
0 answers
Distributed training with torchrun on 3 nodes connection timeout
I have a problem with running a distributed training of pytorch using torchrun. first of all, this is the script I try to run:
import torch
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader
import…

Morteza
- 46
- 2
2
votes
0 answers
How to build custom model using tf.keras on TensorFlow 2.x that supports SageMaker distributed training?
How to create custom models built using tf.keras on TensorFlow 2.x that support distributed training (multiple GPU instances) in Amazon SageMaker?
E.g. using Distributed Data Parallel Library (DDPL)?
The documentation mentioned that tf.keras is not…

juvchan
- 6,113
- 2
- 22
- 35
2
votes
0 answers
`steps_per_epoch` in google ai platform multi-worker distributed training
I'm training model with tensorflow==2.7.0 distributively on gcloud ai platform.
I'm using ParameterServerStrategy strategy, with multiple workers.
One thing I'm confused, and couldn't find answer, is how to properly set number of steps each worker…

govordovsky
- 359
- 2
- 17
2
votes
0 answers
Getting ProcessExitedException. How to spawn multiple processes on databricks notebook using torch.multiprocessing?
I am trying out distributed training in pytorch using "DistributedDataParallel" strategy on databrick notebooks (or any notebooks environment). But I am stuck with multi-processing on a databricks notebook environment.
Problem: I want to spwan…

sarjit07
- 7,511
- 1
- 17
- 15
2
votes
0 answers
MirroredVariable has different values on replicas (zeros, except on one device)
Minimal example to demonstrate the problem:
import tensorflow as tf
with tf.distribute.MirroredStrategy().scope():
print(tf.Variable(1.))
Output on a 4-GPU server:
INFO:tensorflow:Using MirroredStrategy with devices…

isarandi
- 3,120
- 25
- 35
2
votes
1 answer
Is there any way to get global ranks from Pytorch distributed (nccl) group?
Suppose we have a Pytorch distributed group object that initialized by torch.distributed.new_group([a,b,c,d]), is there any way to get the global ranks a,b,c,d from this group?

Qin Heyang
- 1,456
- 1
- 16
- 18
2
votes
1 answer
is there a way to train a ML model on multiple laptops?
I have two laptops and want to use both for the DL model training. I don't have any experience in distributed systems and want to know is it possible to use the processing power of two laptops together to train a single model. What about…

superduper
- 401
- 1
- 5
- 16
2
votes
1 answer
A simple distributed training python program for deep learning models by Horovod on GPU cluster
I am trying to run some example python3 code
https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html
on databricks GPU cluster (with 1 driver and 2 workers).
Databricks environment:
ML 6.6, scala 2.11, Spark…

user3448011
- 1,469
- 1
- 17
- 39
1
vote
0 answers
how to set max gpu memory use for each device when using deepspeed for distributed training?
I am newer to deepspeed, and have some experience in deeplearning. I want to know how to set the max gpu memory to use for each device when using deepspeed?.
I have done nothong. I have no thoughts
my gpu device is about 46G, I want to run long…

hjc
- 9
- 3
1
vote
0 answers
No threads to run a task? I try to use Docker to run distributed training, but failed
use docker in the GNS3 VM
code run in 2 containers
I want to try the pytorch tutorial IMPLEMENTING A PARAMETER SERVER USING DISTRIBUTED RPC FRAMEWORKbut the trainer container gives this error
Rank 1 training batch 0 loss 2.3123748302459717
Process…

user21227198
- 11
- 1
1
vote
2 answers
what happens to model weights and how does checkpointing work?
I have basic question about about model weights and checkpoints.
When training a model, each layer in the model graph calls kernel executed on the GPU. These weights remain on the GPU for forward pass and backward pass. Once the weights are updated…

user3696282
- 25
- 7