Memory usage on node keeps increasing while training a model with Ray Tune

Question

This is the first time I am using Ray Tune to look for the best hyperparameters for an DL model and I am experiencing some problems related to memory usage.

The Memory usage on this node keeps increasing which lead to an error of the trial run. Below is what I get when the script is running.

== Status ==
Current time: 2022-06-16 13:27:43 (running for 00:09:14.60)
Memory usage on this node: 26.0/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/51.37 GiB heap, 0.0/0.93 GiB objects (0.0/1.0 accelerator_type:A40)
Result logdir: /app/ray_experiment_0
Number of trials: 3/20 (1 PENDING, 2 RUNNING)
+--------------+----------+----------------+-----------------+--------------+
| Trial name   | status   | loc            |   learning_rate |   batch_size |
|--------------+----------+----------------+-----------------+--------------|
| run_cf921dd8 | RUNNING  | 172.17.0.3:402 |       0.0374603 |           64 |
| run_d20c6f50 | RUNNING  | 172.17.0.3:437 |       0.0950719 |           64 |
| run_d20e37cc | PENDING  |                |       0.0732021 |           64 |
+--------------+----------+----------------+-----------------+--------------+

I am not sure I completely understand what is Ray accumulating here and how to avoid this accumulation. I have found a few similar issues (this one and this one for instance) but so far, setting

ray.init(object_store_memory = 10**9)

did not help.

The code I am using (copied below) is pretty much copied from the documentation. I am basically using a Bayesian optimization to sample the hyperparameters in a smart way and an ASHAS scheduler to stop the trials early if they are not promising enough

def grid_search(config):

    # For stopping non promising trials early
    scheduler = ASHAScheduler(
        max_t=5,
        grace_period=1,
        reduction_factor=2)

    # Bayesian optimisation to sample hyperparameters in a smarter way
    algo = BayesOptSearch(random_search_steps=4, mode="min")

    reporter = CLIReporter(
        parameter_columns=["learning_rate", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    resources_per_trial = {"cpu": config["n_cpu_per_trials"], "gpu": config["n_gpu_per_trials"]}

    trainable = tune.with_parameters(run)

    analysis = tune.run(trainable,
        resources_per_trial=resources_per_trial,
        metric="loss",
        mode="min",
        config=config,
        num_samples=config["n_sampling"], # Number of times to sample from the hyperparameter space
        scheduler=scheduler,
        progress_reporter=reporter,
        name=config["name_experiment"],
        local_dir="/app/.",
        search_alg=algo)

    print("Best hyperparameters found were: ", analysis.best_config)

I would really appreciate if some of you have managed to solve this issue.

Memory usage on node keeps increasing while training a model with Ray Tune

0 Answers0