memory leak that persists after colab cell executes

Question

I'm encountering a subtle memory leak, and unable to determine the source using tracemalloc. I run the following code in google colab, which is meant to optimize hyperparameters for a custom ppo agent. Also, the speed at which the leak happens varies: sometimes it happens within 10-20 minutes of runtime / 5-10 iterations while on other times it may take up to 50 iterations / several hours. Here's a colab notebook with full code.

import os
from time import perf_counter

import numpy as np
import optuna
import pandas as pd
import xagents
from tensorflow.keras.optimizers import Adam
from xagents import PPO
from xagents.utils.common import ModelReader, create_envs, write_from_dict


def get_hparams(trial):
    return {
        'n_steps': int(
            trial.suggest_categorical('n_steps', [2 ** i for i in range(2, 11)])
        ),
        'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1e-2),
        'epsilon': trial.suggest_loguniform('epsilon', 1e-7, 1e-1),
        'beta_1': trial.suggest_loguniform('beta_1', 0.01, 0.999),
        'beta_2': trial.suggest_loguniform('beta_2', 0.01, 0.999),
        'entropy_coef': trial.suggest_loguniform('entropy_coef', 1e-8, 2e-1),
        'n_envs': int(
            trial.suggest_categorical('n_envs', [2 ** i for i in range(4, 7)])
        ),
        'grad_norm': trial.suggest_uniform('grad_norm', 0.1, 10.0),
        'lam': trial.suggest_loguniform('lam', 0.65, 0.99),
        'advantage_epsilon': trial.suggest_loguniform('advantage_epsilon', 1e-8, 1e-1),
        'clip_norm': trial.suggest_loguniform('clip_norm', 0.01, 10),
    }


def optimize_agent(trial, seed=55):
    hparams = get_hparams(trial)
    envs = create_envs('BreakoutNoFrameskip-v4', hparams['n_envs'])
    model_cfg = xagents.agents['ppo']['model']['cnn'][0]
    optimizer = Adam(
        hparams['learning_rate'],
        epsilon=hparams['epsilon'],
        beta_1=hparams['beta_1'],
        beta_2=hparams['beta_2'],
    )
    model = ModelReader(
        model_cfg,
        [envs[0].action_space.n, 1],
        envs[0].observation_space.shape,
        optimizer,
        seed=seed,
    ).build_model()
    agent = PPO(
        envs,
        model,
        entropy_coef=hparams['entropy_coef'],
        grad_norm=hparams['grad_norm'],
        gamma=hparams['gamma'],
        n_steps=hparams['n_steps'],
        seed=seed,
        lam=hparams['lam'],
        advantage_epsilon=hparams['advantage_epsilon'],
        clip_norm=hparams['clip_norm'],
    )
    steps = 150000
    agent.fit(max_steps=steps)
    current_rewards = np.around(np.mean(agent.total_rewards), 2)
    if not np.isfinite(current_rewards):
        current_rewards = 0
    hparams[f'mean_after_{steps}_steps'] = current_rewards
    write_from_dict(
        {key: [val] for (key, val) in hparams.items()}, 'ppo-optuna.parquet'
    )
    trial.report(current_rewards, 1000)
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    frame = pd.DataFrame(
        [
            {'traceback': item.traceback, 'size': item.size, 'count': item.count}
            for item in top_stats
        ]
    )
    frame.to_csv(f'memory-snapshots/snapshot-{perf_counter()}.csv', index=False)
    return current_rewards

And here's what I run:

os.mkdir('memory-snapshots')
study = optuna.create_study(
    study_name='ppo-trials150',
    load_if_exists=True,
    storage="sqlite:///example-ppo.db",
    direction='maximize',
)
study.optimize(
    optimize_agent, n_trials=1000, show_progress_bar=True, gc_after_trial=True
)

This results in memory becoming inflated very quickly until the session crashes.

And if I stop the cell before it crashes, the memory issue persists until runtime is restarted.

And here are the resulting memory snapshots at 15 consecutive iterations up to the crash, which do not show any particular red flags. Besides, by summing up the size column of the most recent snapshot, it totals 967249710 bytes ~= 1GB, which is weird because the available memory on colab ~= 12GB. Here are the top 23 tracebacks:

And here's one of the crash logs:

Timestamp                   Level   Message
Jul 13, 2021, 9:58:11 AM    WARNING WARNING:root:kernel 834cf1f9-a397-4369-b54a-0cf6da2e980f restarted
Jul 13, 2021, 9:58:11 AM    INFO    KernelRestarter: restarting kernel (1/5), keep random ports
Jul 13, 2021, 9:22:48 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55831c4f8000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13, 2021, 9:21:30 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x5583a976a000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e60dd97 0x7fdc4e6074a5 0x7fdc4e6d829c 0x7fdc4e6a5dd1 0x558257797338 0x5582578cb1ba 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441 0x7fdbf0428133 0x7fdbeb6aad75 0x7fdc56efc6db
Jul 13, 2021, 9:21:27 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55831c4f8000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.307941: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.307851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.221051: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:08 AM    WARNING 2021-07-13 07:20:08.220943: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.646798: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.646709: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.122847: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:06 AM    WARNING 2021-07-13 07:20:06.122724: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Jul 13, 2021, 9:20:03 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x5583a976a000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e60dd97 0x7fdc4e6074a5 0x7fdc4e6d829c 0x7fdc4e6a5dd1 0x558257797338 0x5582578cb1ba 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441 0x7fdbf0428133 0x7fdbeb6aad75 0x7fdc56efc6db
Jul 13, 2021, 9:20:01 AM    WARNING tcmalloc: large alloc 1849688064 bytes == 0x55833a870000 @ 0x7fdc571471e7 0x7fdc4e5bd46e 0x7fdc4e60dc7b 0x7fdc4e610e83 0x7fdc4e61107b 0x7fdc4e6b2761 0x558257855d54 0x558257855a50 0x5582578ca105 0x5582578c44ae 0x5582578573ea 0x5582578c97f0 0x5582578c44ae 0x5582578573ea 0x5582578c53b5 0x5582578c47ad 0x558257857a81 0x558257857ea1 0x5582578c6bb5 0x5582578c47ad 0x558257796eb1 0x5582578c6bb5 0x5582578c47ad 0x558257857a81 0x55825789afd9 0x558257857ea1 0x7fdbefb83954 0x7fdbefb873ba 0x7fdbeaf6eeb4 0x7fdbeaf631fe 0x7fdbf042b441
Jul 13, 2021, 9:18:49 AM    WARNING 2021-07-13 07:18:49.736991: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
Jul 13, 2021, 9:18:47 AM    WARNING 2021-07-13 07:18:47.153482: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
Jul 13, 2021, 9:18:23 AM    WARNING 2021-07-13 07:18:23.372605: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8004
Jul 13, 2021, 9:18:21 AM    WARNING 2021-07-13 07:18:21.045353: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
Jul 13, 2021, 9:18:18 AM    WARNING 2021-07-13 07:18:18.540742: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2199995000 Hz
Jul 13, 2021, 9:18:18 AM    WARNING 2021-07-13 07:18:18.470990: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.026567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13837 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.026490: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.025754: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.024324: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.022506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.021712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.021041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
Jul 13, 2021, 9:17:53 AM    WARNING 2021-07-13 07:17:53.019732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.878647: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.875397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.874569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.873445: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
Jul 13, 2021, 9:17:47 AM    WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.873293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.872460: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.871015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.867034: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.865768: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.865559: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.860081: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.843825: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.552444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.509098: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.374421: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.374270: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.232497: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13, 2021, 9:17:47 AM    WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
Jul 13, 2021, 9:17:47 AM    WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.232421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.231418: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jul 13, 2021, 9:17:47 AM    WARNING 2021-07-13 07:17:47.168074: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
Jul 13, 2021, 9:17:31 AM    WARNING 2021-07-13 07:17:31.798739: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Jul 13, 2021, 9:13:58 AM    INFO    Uploading file to /content/xagents.tar.gz
Jul 13, 2021, 9:13:33 AM    INFO    Adapting to protocol v5.1 for kernel 834cf1f9-a397-4369-b54a-0cf6da2e980f
Jul 13, 2021, 9:13:32 AM    INFO    Kernel started: 834cf1f9-a397-4369-b54a-0cf6da2e980f
Jul 13, 2021, 9:13:24 AM    INFO    Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jul 13, 2021, 9:13:24 AM    INFO    http://172.28.0.12:9000/
Jul 13, 2021, 9:13:24 AM    INFO    The Jupyter Notebook is running at:
Jul 13, 2021, 9:13:24 AM    INFO    0 active kernels
Jul 13, 2021, 9:13:24 AM    INFO    Serving notebooks from local directory: /
Jul 13, 2021, 9:13:24 AM    INFO    google.colab serverextension initialized.
Jul 13, 2021, 9:13:24 AM    INFO    Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jul 13, 2021, 9:13:24 AM    INFO    http://172.28.0.2:9000/
Jul 13, 2021, 9:13:24 AM    INFO    The Jupyter Notebook is running at:
Jul 13, 2021, 9:13:24 AM    INFO    0 active kernels
Jul 13, 2021, 9:13:24 AM    INFO    Serving notebooks from local directory: /
Jul 13, 2021, 9:13:24 AM    INFO    google.colab serverextension initialized.
Jul 13, 2021, 9:13:24 AM    INFO    Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Jul 13, 2021, 9:13:24 AM    INFO    Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret

memory leak that persists after colab cell executes

0 Answers0

Linked