Deployemnt Time out error in AKS and Endpoint stuck in "Transitioning" state

Question

We are working on the deployment of 170 ML models using ML studio and azure Kubernetes service which is referred on the below doc link "https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/machine-learning/how-to-deploy-azure-kubernetes-service.md".

We are training the model using python script with the custom environment and we are registering the ml model on the Azure ML services. Once we register the mode we are deploying it on the AKS by using the container images.

While deploying the ML model we are able to deploy up to 10 to 11 models per pod for each Node in AKS. When we try to deploy the model on the same node we are getting deployment timeout error and we are getting the below error message.

For deploying the model in Azure Kubernetes Service using python language with below sample code.

#  Create an environment and add conda dependencies to it and for this creating our environment and building the custom container image.
     myenv = Environment(name = Deployment_name)
     myenv.python.conda_dependencies = CondaDependencies.create(pip_packages)
    
        
 #  Inference_Conifiguration
     inf_config = InferenceConfig(environment= myenv, entry_script='./Script_file.py')
    
    
 # Deployment_Conifiguration
     deployment_config = AksWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1, cpu_cores_limit = 2, memory_gb_limit = 2, traffic_percentile = 10)
    
 #  AKS cluster compute target 
     aks_target = ComputeTarget(ws, 'pipeline')
       
    
#  Deploying the model in AKS server
       service = Model.deploy(ws, Deployment_name, model_1, inf_config,
                   deployment_config, aks_target, overwrite=True)
    
        service.wait_for_deployment(show_output=True)

We also checked on the azure documentation and we could able to find any configuration or deployment setup for aks nodes.

Can you please provide us more clarification regarding "The number of models to be deployed is limited to 1,000 models per deployment (per container)" and Can you please give insight/feedback on how to increase the number of ml models that can be deployed in each node in Azure Kubernetes Service? Thanks!

What have you gotten to work so far? Can you deploy 2 models to the same container? — Anders Swanson, Sep 06 '21 at 23:46
Yes, we can deploy 2 ML models on the same container, and then we are using AKS clusters for the deployment. We are trying to deploy 171 deployments on the AKS clusters and each deployment has 2 ML models in it. We are able to do between 10 to 11 deployments on a single node of clusters. when we are deploying more than 10 to 11 deployments on AKS we are getting the deployment timeout error. Currently, we have 16 nodes for deploying 160 deployments on the AKS cluster. we are trying to reduce the node count on the AKS by increasing the deployment count on the single AKS clusters. — Hari Balaji, Sep 07 '21 at 05:30
Also, we are looking for how many deployments can be done on a single node in clusters? — Hari Balaji, Sep 07 '21 at 05:30

score 0 · Answer 1 · answered Sep 07 '21 at 10:52

Based on the error looks like there is some issue with your PVC.

The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner based on the requested storage class, or pre-provisioned by an admin.

There should be a StorageClass which can dynamically provision the PV and mention that storageClassName in the volumeClaimTemplates or there needs to be a PV which can satisfy the PVC.

volumeClaimTemplates:
  - metadata:
      name: elasticsearch-data-persistent-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "standard"
      resources:
        requests:
          storage: 10Gi

Reference: pod has unbound immediate PersistentVolumeClaims (repeated 3 times)

Follow this GITHUB discussion as well : https://github.com/hashicorp/consul-helm/issues/237

Deployemnt Time out error in AKS and Endpoint stuck in "Transitioning" state

1 Answers1