I am generating BERT embeddings using GPU and using them to train a catboost model. The embeddings part is running without any issue on GPU. The problem occurs when I try to convert these tensors to numpy , almost all the RAM is consumed (approx 20gb) although the total training data size is only 2GB.
The environment I am running it is:-
- Google Kubernetes Engine:- 1.21.11-gke.900
- CUDA 11.0
- Pytorch 1.11.0
- nvidia/cuda:11.0.3-base-ubuntu20.04 docker image
Here is sample piece of code along with memory profiling :-
Line # Mem usage Increment Occurrences Line Contents
=============================================================
568 1834.4 MiB 1834.4 MiB 1 @staticmethod
569 @profile(stream=fp)
570 def prepare_train_test_df(df_final):
571 1834.4 MiB 0.0 MiB 1 LOG.info("Prepare train test sample.....")
572 5532.2 MiB 0.4 MiB 2 df_final["title_cat_ar"] = df_final["title_cat_embed2_1"].apply(
573 5531.8 MiB 3697.4 MiB 1876353 lambda row: row.detach().cpu().numpy()
574 )
575 5532.2 MiB 0.0 MiB 1 LOG.info("Converting column {} to numpy.....".format("title_cat_embed2_1"))
576
577 8820.9 MiB 0.0 MiB 2 df_final["desc_embed_ar_1"] = df_final["desc_embed2_1"].apply(
578 8820.8 MiB 3288.7 MiB 1876353 lambda row: row.detach().cpu().numpy()
579 )
580 8820.9 MiB 0.0 MiB 1 LOG.info("Converting column {} to numpy.....".format("desc_embed2_1"))
581
582 12109.9 MiB 1.0 MiB 2 df_final["title_cat_embed_ar_2"] = df_final["title_cat_embed2_2"].apply(
583 12108.9 MiB 3288.0 MiB 1876353 lambda row: row.detach().cpu().numpy()
584 )
585 12109.9 MiB 0.0 MiB 1 LOG.info("Converting column {} to numpy.....".format("title_cat_embed2_2"))
586
587 15397.7 MiB 0.8 MiB 2 df_final["desc_embed_ar_2"] = df_final["desc_embed2_2"].apply(
588 15396.9 MiB 3287.0 MiB 1876353 lambda row: row.detach().cpu().numpy()
589
)
As seen from the profiler, report almost 3GB of memory is being occupied after row.detach().cpu().numpy()
operation on a pandas df column. The size of each column in dataframe is itself not that high (max is 112mb) so it seems like something else is consuming the memory.
## memory consumption by column in bytes
title_embed 60043264
title_embed2 60043264
title_cat_ar 112581120
assigning a single tensor and then measuring memory consumption. for some reason a single tensor conversion occupies 799mb of space:-
Line # Mem usage Increment Occurrences Line Contents
=============================================================
568 1835.1 MiB 1835.1 MiB 1 @staticmethod
569 @profile(stream=fp)
570 def prepare_train_test_df(df_final):
571 1835.1 MiB 0.0 MiB 1 LOG.info("Prepare train test sample.....")
572 1835.1 MiB 0.0 MiB 1 first_row = df_final["title_cat_embed2_1"].iloc[0]
573 2634.2 MiB 799.1 MiB 1 first_row_numpy = first_row.detach().cpu().numpy()
574 2634.2 MiB 0.0 MiB 1 frow = df_final["title_cat_embed2_1"].iloc[1].tolist()
575 5532.7 MiB 0.2 MiB 2 df_final["title_cat_ar"] = df_final["title_cat_embed2_1"].apply(
576 5532.5 MiB 2898.3 MiB 1876353 lambda row: row.detach().cpu().numpy()
577 )
578 5526.2 MiB -6.5 MiB 1 del df_final["title_cat_embed2_1"]
579 5526.2 MiB 0.0 MiB 1 LOG.info("mempry consumed is {}".format(df_final.memory_usage(deep=True)))
580 5526.2 MiB 0.0 MiB 1 LOG.info("Converting column {} to numpy.....".format("title_cat_embed2_1"))
581
582 5526.2 MiB 0.0 MiB 1 desc_row = df_final["desc_embed2_1"].iloc[0]
583 5526.2 MiB 0.0 MiB 1 desc_row = desc_row.detach().cpu().numpy()
584 5526.2 MiB 0.0 MiB 1 drow = df_final["desc_embed2_1"].iloc[1].tolist()
585 8814.8 MiB 0.2 MiB 2 df_final["desc_embed_ar_1"] = df_final["desc_embed2_1"].apply(
586 8814.6 MiB 3288.4 MiB 1876353 lambda row: row.detach().cpu().numpy()
587 )
588 8807.7 MiB -7.2 MiB 1 del df_final["desc_embed2_1"]