I have converted RoBERTa PyTorch model to ONNX model and quantized it. I am able to get the scores from ONNX model for single input data point (each sentence). I want to understand how to get batch predictions using ONNX Runtime inference session by passing multiple inputs to the session. Below is the example scenario.
Model : roberta-quant.onnx which is a ONNX quantized version of RoBERTa PyTorch model
Code used to convert RoBERTa to ONNX:
torch.onnx.export(model,
args=tuple(inputs.values()), # model input
f=export_model_path, # where to save the model
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=['input_ids', # the model's input names
'attention_mask'],
output_names=['output_0'], # the model's output names
dynamic_axes={'input_ids': symbolic_names, # variable length axes
'attention_mask' : symbolic_names,
'output_0' : {0: 'batch_size'}})
Input sample to ONNXRuntime inference session:
{
'input_ids': array([[ 0, 510, 35, 21071, ....., 1, 1, 1, 1, 1, 1]]),
'attention_mask': array([[1, 1, 1, 1, ......., 0, 0, 0, 0, 0, 0]])
}
Running ONNX model for 400 data samples(sentences) using ONNXRuntime inference session:
session = onnxruntime.InferenceSession("roberta_quantized.onnx", providers=['CPUExecutionProvider'])
for i in range(400):
ort_inputs = {
'input_ids': input_ids[i].cpu().reshape(1, max_seq_length).numpy(), # max_seq_length=128 here
'input_mask': attention_masks[i].cpu().reshape(1, max_seq_length).numpy()
}
ort_outputs = session.run(None, ort_inputs)
In the above code I am looping through 400 sentences sequentially to get the scores "ort_outputs
". Please help me understand how can I perform batch processing here using the ONNX model, where I can send the inputs_ids
and attention_masks
for multiple sentences and get the scores for all sentences in ort_outputs
.
Thanks in advance!