How to perform Batch inferencing with RoBERTa ONNX quantized model?

Question

I have converted RoBERTa PyTorch model to ONNX model and quantized it. I am able to get the scores from ONNX model for single input data point (each sentence). I want to understand how to get batch predictions using ONNX Runtime inference session by passing multiple inputs to the session. Below is the example scenario.

Model : roberta-quant.onnx which is a ONNX quantized version of RoBERTa PyTorch model

Code used to convert RoBERTa to ONNX:

torch.onnx.export(model,                                            
                      args=tuple(inputs.values()),                      # model input 
                      f=export_model_path,                              # where to save the model 
                      opset_version=11,                                 # the ONNX version to export the model to
                      do_constant_folding=True,                         # whether to execute constant folding for optimization
                      input_names=['input_ids',                         # the model's input names
                                   'attention_mask'],
                      output_names=['output_0'],                    # the model's output names
                      dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                    'attention_mask' : symbolic_names,
                                    'output_0' : {0: 'batch_size'}})

Input sample to ONNXRuntime inference session:

{
     'input_ids': array([[    0, 510, 35, 21071, ....., 1, 1,  1,  1, 1, 1]]),
     'attention_mask': array([[1, 1, 1, 1, ......., 0, 0, 0, 0, 0, 0]])
}

Running ONNX model for 400 data samples(sentences) using ONNXRuntime inference session:

session = onnxruntime.InferenceSession("roberta_quantized.onnx", providers=['CPUExecutionProvider'])
for i in range(400):
   ort_inputs = {
    'input_ids':  input_ids[i].cpu().reshape(1, max_seq_length).numpy(),  # max_seq_length=128 here
    'input_mask': attention_masks[i].cpu().reshape(1, max_seq_length).numpy()
   }

   ort_outputs = session.run(None, ort_inputs)

In the above code I am looping through 400 sentences sequentially to get the scores "ort_outputs". Please help me understand how can I perform batch processing here using the ONNX model, where I can send the inputs_ids and attention_masks for multiple sentences and get the scores for all sentences in ort_outputs.

Thanks in advance!

https://github.com/microsoft/onnxruntime/issues/8030 - Link to solution — Yamini Preethi K, Jun 15 '21 at 05:27

How to perform Batch inferencing with RoBERTa ONNX quantized model?

0 Answers0