5

I wanted to test a pretrained model downloaded from here to perform an ocr task. Link to download, its name is CRNN_VGG_BiLSTM_CTC.onnx. This model is extracted from here. The sample-image.png can be download from here (see the code bellow).

When I do the forward of the neural network to predict (ocr) in the blob I get the following error:

error: OpenCV(4.4.0) /tmp/pip-req-build-xgme2194/opencv/modules/dnn/src/layers/convolution_layer.cpp:348: error: (-215:Assertion failed) ngroups > 0 && inpCn % ngroups == 0 && outCn % ngroups == 0 in function 'getMemoryShapes'

Feel free to read the code bellow. I tried many things, it's weird because this model does not require a predetermined input shape. If you know any way to read this model and do the forward it is also going to be helpful but I'd rather solve using OpenCV.

import cv2 as cv
# The model is downloaded from here https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr
# model path 
modelRecognition = os.path.join(MODELS_PATH,'CRNN_VGG_BiLSTM_CTC.onnx')
# read net 
recognizer = cv.dnn.readNetFromONNX(modelRecognition)

# Download sample_image.png from https://i.ibb.co/fMmCB7J/sample-image.png  (image host website)
sample_image = cv.imread('sample-image.png')
# Height , Width and number of channels of the image
H, W, C = sample_image.shape

# Create a 4D blob from cropped image
blob = cv.dnn.blobFromImage(sample_image, size = (H, W))

recognizer.setInput(blob)

# Here is where i get the errror that I mentioned before 
result = recognizer.forward()

Thank you so much in advance.

AbdelAziz AbdelLatef
  • 3,650
  • 6
  • 24
  • 52
Tom
  • 496
  • 8
  • 16

1 Answers1

4

Your problem is actually that the input data you feed to your model doesn't match the shape of the data the model was trained on.

I used this answer to inspect your onnx model and it appears that it expects an input of shape (1, 1, 32, 100). I modified your code to reshape the image to 1 x 32 x 100 pixels and the inference actually runs without error.

EDIT

I've added some code to interpret the result of the inference. We now display the image and the inferred OCR text. This doesn't seem to be working, but reading the tutorial on OpenCV, there should be two models:

  1. one that detects where there is text in the image. This network accepts images of various sizes, it returns the locations of text within the image and then cropped parts of the image, of sizes 100x32 are passed to the second
  2. one that actually does the "reading" and given patches of image, returns the characters. For this, there a file alphabet_36.txt that is provided together with the pre-trained models.

It isn't clear to me though which network to use for text detection. Hope the edited code below helps you develop your application further.

import cv2 as cv
import os
import numpy as np
import matplotlib.pyplot as plt
# The model is downloaded from here https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr
# model path 
MODELS_PATH = './'
modelRecognition = os.path.join(MODELS_PATH,'CRNN_VGG_BiLSTM_CTC.onnx')

# read net 
recognizer = cv.dnn.readNetFromONNX(modelRecognition)

# Download sample_image.png from https://i.ibb.co/fMmCB7J/sample-image.png  (image host website)
sample_image = cv.imread('sample-image.png', cv.IMREAD_GRAYSCALE)
sample_image = cv.resize(sample_image, (100, 32))
sample_image = sample_image[:,::-1].transpose()

# Height and Width of the image
H,W = sample_image.shape

# Create a 4D blob from image
blob = cv.dnn.blobFromImage(sample_image, size=(H,W))
recognizer.setInput(blob)

# network inference
result = recognizer.forward()

# load alphabet
with open('alphabet_36.txt') as f:
    alphabet = f.readlines()
alphabet = [f.strip() for f in alphabet]

# interpret inference results
res = []
for i in range(result.shape[0]):
    ind = np.argmax(result[i,0])
    res.append(alphabet[ind])
ocrtxt = ''.join(res)

# show image and detected OCR characters
plt.imshow(sample_image)
plt.title(ocrtxt)
plt.show()

Hope it helps. Cheers

Christian
  • 1,162
  • 12
  • 21
  • Hello, Chrsitian, thank you so much for you answer. I was pretty confident that the architecture of this net didn't require a fixed input size (because of the convolutions operations inside of them) but I wasn't right. Such a mistake! You definitely solved the problem I posted and therefore deserved the answer marked as correct. How ever, this net is for an OCR model, do you have any idea why resizing to 100,32 would make sence? I thought this ocr model would work for words. But maybe this model works just for characters and what it needs is a previous segmentation model? – Tom Jan 08 '21 at 17:49
  • If you'd like to share your thoughts about this that'd be super helpful. Thank you so much in advance! – Tom Jan 08 '21 at 17:49
  • Hello Tom, you're welcome! Well, I'm not familiar with this particular network (do you have its architecture somewhere, from a paper or so?) but if it has fully connected layers after the convolutional layers, then the network actually depends on the shape of the input. – Christian Jan 08 '21 at 18:03
  • Regarding whether the resizing makes sense, it is a good question. It makes sense if the characters are approximately the same size *after resizing* than the ones at training. If not, the network will probably struggle to return the correct characters. – Christian Jan 08 '21 at 18:05
  • Hi Tom, see my edits in my answer, I hope it helps you further – Christian Jan 09 '21 at 13:20
  • Hi, Christian, yeah I got confused by the fact that says "VGG" network so I thought it uses VGG network at the initial point of the architecture and therefore didn't rely on input shape, but it's clear that it does depend on input shape so forget these words haha. No, I didn't find a paper for this yet, unfortunately. – Tom Jan 10 '21 at 14:26
  • Finally, I took a look at this implementation for webcams https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.py and realize that I did not rescale the image and that was making the performance of the model so poor. If you rescale the image by mean = 127.5 and std=127.5 you'll start to get decent results, I also used character level recognition. – Tom Jan 10 '21 at 14:27
  • I didn't think of rescaling the data, well done! – Christian Jan 10 '21 at 17:37