I am training a sagemaker XGBoost model and i want to get the feature importance of each feature, so i am using smdebug to collect the feature importance while training.But, during training i am facing "ValueError: Message smdebug.Event exceeds maximum protobuf size of 2GB: 2663116783" error. I am not able to proceed with training. How can i resolve this issue?
hyperparameters = {
"max_depth": "5",
"eta": "0.1",
"gamma": "4",
"min_child_weight": "6",
"subsample": "0.7",
"silent": "0",
"objective": "binary:logistic",
"num_round": "150",
}
save_interval=5
xgboost_estimator = sagemaker.estimator.Estimator(
role=role,
base_job_name='xgboost',
instance_count=1,
instance_type="ml.m5.4xlarge",
image_uri=container,
hyperparameters=hyperparameters,
max_run=1800,
debugger_hook_config=DebuggerHookConfig(
s3_output_path=s3_model_output_location, # Required
collection_configs=[
CollectionConfig(name="metrics", parameters={"save_interval": str(save_interval)}),
CollectionConfig(
name="feature_importance",
parameters={"save_interval": str(save_interval)},
),
CollectionConfig(name="full_shap", parameters={"save_interval": str(save_interval)}),
CollectionConfig(name="average_shap", parameters={"save_interval": str(save_interval)}),
],
),
rules=[
Rule.sagemaker(
rule_configs.loss_not_decreasing(),
rule_parameters={
"collection_names": "metrics",
"num_steps": str(save_interval * 2),
},
),
],
)
data_channels = {'train': training_input_config, 'validation': validation_input_config}
model = xgboost_estimator.fit(data_channels)
2023-04-07 17:03:53 Starting - Starting the training job...
2023-04-07 17:04:17 Starting - Preparing the instances for trainingLossNotDecreasing: InProgress
......
2023-04-07 17:05:17 Downloading - Downloading input data......
2023-04-07 17:06:17 Training - Training image download completed. Training in progress..INFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training
INFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.
Returning the value itself
INFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)
INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
INFO:root:Determined delimiter of CSV input is ','
[17:06:27] 1794553x370 matrix with 663984610 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
INFO:root:Determined delimiter of CSV input is ','
[17:06:32] 769095x370 matrix with 284565150 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,
INFO:root:Single node training.
[2023-04-07 17:06:32.846 ip-10-0-204-248.ec2.internal:7 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2023-04-07 17:06:32.846 ip-10-0-204-248.ec2.internal:7 INFO hook.py:151] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2023-04-07 17:06:32.846 ip-10-0-204-248.ec2.internal:7 INFO hook.py:196] Saving to /opt/ml/output/tensors
INFO:root:Debug hook created from config
INFO:root:Train matrix has 1794553 rows
INFO:root:Validation matrix has 769095 rows
[0]#011train-error:0.048332#011validation-error:0.048469
[2023-04-07 17:06:50.422 ip-10-0-204-248.ec2.internal:7 INFO hook.py:325] Monitoring the collections: full_shap, metrics, feature_importance, losses, average_shap
Exception in thread Thread-1:
Traceback (most recent call last):
File "/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/miniconda3/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 162, in run
positions = self._ev_writer.write_event(event)
File "/miniconda3/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 37, in write_event
return self._write_serialized_event(event.SerializeToString())
ValueError: Message smdebug.Event exceeds maximum protobuf size of 2GB: 2663116783