Highest Voted 'distributed-training' Questions

4

votes

0 answers

Issues when using HuggingFace `accelerate` with `fp16`

I'm trying to use accelerate module to parallelize my model training. But I have troubles to use it when training models with fp16. If I load the model with torch_dtype=torch.float16, I got ValueError: Attempting to unscale FP16 gradients.. But if I…

pytorch huggingface distributed-training

asked Mar 21 '23 at 15:02

weiqis

41
1
3

3

votes

1 answer

On batch size, epochs, and learning rate of DistributedDataParallel

I have read these threads [1] [2] [3] [4], and this article. I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate. Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the…

pytorch distributed-training

asked Apr 22 '22 at 01:40

Simon

5,070
5
33
59

3

votes

1 answer

tf.data vs tf.keras.preprocessing.image.ImageDataGenerator

I was reading about different techniques to load large data efficiently. The tf.data seems to perform well as compared to tf.keras.preprocessing.imageImageDataGenerator. To what I know is, tf.data uses CPU pipelining to efficiently load the data…

python tensorflow keras distributed-computing distributed-training

asked Mar 31 '20 at 08:11

superduper

401
1
5
16

3

votes

1 answer

Distributed training over local gpu and colab gpu

I want to fine tune ALBERT. I see one can distribute neural net training over multiple gpus using tensorflow: https://www.tensorflow.org/guide/distributed_training I was wondering if it's possible to distribute fine-tuning across both my laptop's…

python tensorflow gpu google-colaboratory distributed-training

asked Mar 27 '20 at 19:49

Gog

93
6

2

votes

0 answers

Distributed training with torchrun on 3 nodes connection timeout

I have a problem with running a distributed training of pytorch using torchrun. first of all, this is the script I try to run: import torch from torch.utils.data.distributed import DistributedSampler from torch.utils.data import DataLoader import…

networking deep-learning pytorch iptables distributed-training

asked Apr 21 '23 at 15:17

Morteza

46
2

2

votes

0 answers

How to build custom model using tf.keras on TensorFlow 2.x that supports SageMaker distributed training?

How to create custom models built using tf.keras on TensorFlow 2.x that support distributed training (multiple GPU instances) in Amazon SageMaker? E.g. using Distributed Data Parallel Library (DDPL)? The documentation mentioned that tf.keras is not…

amazon-web-services tensorflow2.0 amazon-sagemaker distributed-training

asked Sep 17 '22 at 04:37

juvchan

6,113
2
22
35

2

votes

0 answers

`steps_per_epoch` in google ai platform multi-worker distributed training

I'm training model with tensorflow==2.7.0 distributively on gcloud ai platform. I'm using ParameterServerStrategy strategy, with multiple workers. One thing I'm confused, and couldn't find answer, is how to properly set number of steps each worker…

tensorflow google-cloud-platform google-ai-platform google-cloud-ai distributed-training

asked Jan 28 '22 at 19:02

govordovsky

359
2
17

2

votes

0 answers

Getting ProcessExitedException. How to spawn multiple processes on databricks notebook using torch.multiprocessing?

I am trying out distributed training in pytorch using "DistributedDataParallel" strategy on databrick notebooks (or any notebooks environment). But I am stuck with multi-processing on a databricks notebook environment. Problem: I want to spwan…

pytorch multiprocessing databricks multiple-gpu distributed-training

asked Nov 09 '21 at 16:14

sarjit07

7,511
1
17
15

2

votes

0 answers

MirroredVariable has different values on replicas (zeros, except on one device)

Minimal example to demonstrate the problem: import tensorflow as tf with tf.distribute.MirroredStrategy().scope(): print(tf.Variable(1.)) Output on a 4-GPU server: INFO:tensorflow:Using MirroredStrategy with devices…

tensorflow tensorflow2.0 multi-gpu distributed-training

asked Oct 22 '21 at 11:19

isarandi

3,120
25
35

2

votes

1 answer

Is there any way to get global ranks from Pytorch distributed (nccl) group?

Suppose we have a Pytorch distributed group object that initialized by torch.distributed.new_group([a,b,c,d]), is there any way to get the global ranks a,b,c,d from this group?

deep-learning pytorch distributed-training

asked Sep 29 '21 at 20:35

Qin Heyang

1,456
1
16
18

2

votes

1 answer

is there a way to train a ML model on multiple laptops?

I have two laptops and want to use both for the DL model training. I don't have any experience in distributed systems and want to know is it possible to use the processing power of two laptops together to train a single model. What about…

python tensorflow keras distributed-training

asked Jul 22 '20 at 21:52

superduper

401
1
5
16

2

votes

1 answer

A simple distributed training python program for deep learning models by Horovod on GPU cluster

I am trying to run some example python3 code https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html on databricks GPU cluster (with 1 driver and 2 workers). Databricks environment: ML 6.6, scala 2.11, Spark…

deep-learning gpu databricks horovod distributed-training

asked Jul 11 '20 at 21:15

user3448011

1,469
1
17
39

1

vote

0 answers

how to set max gpu memory use for each device when using deepspeed for distributed training?

I am newer to deepspeed, and have some experience in deeplearning. I want to know how to set the max gpu memory to use for each device when using deepspeed?. I have done nothong. I have no thoughts my gpu device is about 46G, I want to run long…

out-of-memory distributed-training deepspeed

asked Jul 24 '23 at 07:39

hjc

9
3

1

vote

0 answers

No threads to run a task? I try to use Docker to run distributed training, but failed

use docker in the GNS3 VM code run in 2 containers I want to try the pytorch tutorial IMPLEMENTING A PARAMETER SERVER USING DISTRIBUTED RPC FRAMEWORKbut the trainer container gives this error Rank 1 training batch 0 loss 2.3123748302459717 Process…

python docker pytorch distributed-training gns3

asked Feb 16 '23 at 15:31

user21227198

11
1

1

vote

2 answers

what happens to model weights and how does checkpointing work?

I have basic question about about model weights and checkpoints. When training a model, each layer in the model graph calls kernel executed on the GPU. These weights remain on the GPU for forward pass and backward pass. Once the weights are updated…

deep-learning pytorch distributed-training

asked Feb 07 '23 at 01:05

user3696282

25
7

Questions tagged [distributed-training]