ValueError: Message smdebug.Event exceeds maximum protobuf size of 2GB: 2663116783

Question

I am training a sagemaker XGBoost model and i want to get the feature importance of each feature, so i am using smdebug to collect the feature importance while training.But, during training i am facing "ValueError: Message smdebug.Event exceeds maximum protobuf size of 2GB: 2663116783" error. I am not able to proceed with training. How can i resolve this issue?

hyperparameters = {
    "max_depth": "5",
    "eta": "0.1",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "silent": "0",
    "objective": "binary:logistic",
    "num_round": "150",
}
save_interval=5

xgboost_estimator = sagemaker.estimator.Estimator(
    role=role,
    base_job_name='xgboost',
    instance_count=1,
    instance_type="ml.m5.4xlarge",
    image_uri=container,
    hyperparameters=hyperparameters,
    max_run=1800,
    debugger_hook_config=DebuggerHookConfig(
        s3_output_path=s3_model_output_location,  # Required
        collection_configs=[
            CollectionConfig(name="metrics", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(
                name="feature_importance",
                parameters={"save_interval": str(save_interval)},
            ),
            CollectionConfig(name="full_shap", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(name="average_shap", parameters={"save_interval": str(save_interval)}),
        ],
    ),
    rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": str(save_interval * 2),
            },
        ),
    ],
)

data_channels = {'train': training_input_config, 'validation': validation_input_config}

model = xgboost_estimator.fit(data_channels)

2023-04-07 17:03:53 Starting - Starting the training job...
2023-04-07 17:04:17 Starting - Preparing the instances for trainingLossNotDecreasing: InProgress
......
2023-04-07 17:05:17 Downloading - Downloading input data......
2023-04-07 17:06:17 Training - Training image download completed. Training in progress..INFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training
INFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.
Returning the value itself
INFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)
INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
[17:06:27] 1794553x370 matrix with 663984610 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
INFO:root:Determined delimiter of CSV input is ','
[17:06:32] 769095x370 matrix with 284565150 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,
INFO:root:Single node training.
[2023-04-07 17:06:32.846 ip-10-0-204-248.ec2.internal:7 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2023-04-07 17:06:32.846 ip-10-0-204-248.ec2.internal:7 INFO hook.py:151] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2023-04-07 17:06:32.846 ip-10-0-204-248.ec2.internal:7 INFO hook.py:196] Saving to /opt/ml/output/tensors
INFO:root:Debug hook created from config
INFO:root:Train matrix has 1794553 rows
INFO:root:Validation matrix has 769095 rows
[0]#011train-error:0.048332#011validation-error:0.048469
[2023-04-07 17:06:50.422 ip-10-0-204-248.ec2.internal:7 INFO hook.py:325] Monitoring the collections: full_shap, metrics, feature_importance, losses, average_shap
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/miniconda3/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 162, in run
    positions = self._ev_writer.write_event(event)
  File "/miniconda3/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 37, in write_event
    return self._write_serialized_event(event.SerializeToString())
ValueError: Message smdebug.Event exceeds maximum protobuf size of 2GB: 2663116783

ValueError: Message smdebug.Event exceeds maximum protobuf size of 2GB: 2663116783

0 Answers0