4

so I am using opencv to do template matching like below. I constantly need to fiddle with the visual similarity #THRESHOLD, because it fails to discover matches sometimes or it returns way too many matches. It's a trial and error until it matches exactly 1 element in a position in a document. I'm wonder if there is any way to automate this somehow.

the image.png file is a picture of a pdf document. the template.png file is a picture of paragraph. My goal is to discover all the paragraphs in the pdf document and I want to know what neural network is useful here.

import cv2
import numpy as np


img = cv2.imread("image.png");
gimg = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
template = cv2.imread("template.png", cv2.IMREAD_GRAYSCALE);
w, h = template.shape[::-1]


result = cv2.matchTemplate(gimg, template, cv2.TM_CCOEFF_NORMED)

loc = np.where(result >= 0.36) #THRESHOLD
print(loc)

for pt in zip(*loc[::-1]):
        cv2.rectangle(img, pt, (pt[0] + w, pt[1] + h), (0,255,0), 3)

cv2.imwrite("output.png", img)

so for instance, it will search for every #THRESHOLD value from 0 to 1.0 and return a threshold value that returns a single rectangle match (draws green box above) in the image.

However, I can't help but feel this is very exhuastive, or is there a smarter way to find out what the threshold value is?

user299709
  • 4,922
  • 10
  • 56
  • 88
  • 2
    If you are just trying to extract all paragraphs, each separated, then perhaps use a morphology kernel to blend the text into lines or paragraphs of black rectangles, then use contours to find each paragraph. Search this forum as I have seen just such examples before. – fmw42 Jan 26 '20 at 22:03
  • can you elborate with known libraries? – user299709 Jan 26 '20 at 22:05
  • 2
    Is your template taken from the original image? Please post both the image and the template, so we can help you better – J.D. Jan 26 '20 at 22:54
  • See https://stackoverflow.com/questions/59923076/how-to-automatically-adjust-the-threshold-for-template-matching-with-opencv?noredirect=1#comment105969511_59923076 and https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html and https://answers.opencv.org/question/27411/use-opencv-to-detect-text-blocks-send-to-tesseract-ios/ and https://stackoverflow.com/questions/51436896/extracting-text-opencv-contours/51443493 – fmw42 Jan 26 '20 at 22:56
  • 3
    you could use minmaxloc to get the minimum/maximum value (best found position). And if you still need multiple detections, choose a threshold according to the minimum/maximum?!! – Micka Jan 26 '20 at 23:07
  • @Micka can you show me with a code? where do i set minmaxloc – user299709 Jan 26 '20 at 23:37
  • minmaxloc is an opencv function and is afaik part of the official template matching code example. – Micka Jan 27 '20 at 05:20
  • 1
    Template Matching is **NOT** the way to go about this, as you probably noticed by now. The other users have shared a ton of very interesting resources on how to do text detection. Consider adding the samples that @nathancy suggested if you want better answers. – karlphillip Jan 31 '20 at 10:21
  • @user299709 If you don't share sample images, this thread will die. There's nothing left to talk about. – karlphillip Feb 02 '20 at 11:05

2 Answers2

2

Since there were lots of comments and hardly any responses, I will summarize the answers for future readers.

First off, your question is almost identical to How to detect paragraphs in a text document image for a non-consistent text structure in Python. Also this thread seems to address the problem you are tackling: Easy ways to detect and crop blocks (paragraphs) of text out of image?

Second, detecting paragraphs in a PDF should not be done with template matching but with one of the following approaches:

  1. Using the canny edge detector in combination with dilation and F1 Score optimization. This is often used for OCR as suggested by fmw42.
  2. Alternatively, you could use Stroke Width Transform (SWT) to identify text which you then group into lines and finally blocks i.e. paragraphs. For OCR these blocks can then be passed to Tesseract (as suggested by fmw42)

The key in any OCR task is to simplify the text detection problem as much as possible by removing disruptive features of the image by altering the image as needed. The more information you have about the image you are processing beforehand the better: change colors, binarize, threshold, dilate, apply filters, etc.

To answer your question on finding the best match in template matching: Checkout nathancy's answer on template matching. In essence, it comes down to finding the maximum correlation value using minMaxLoc. See this excerpt from Nathancy's answer:

    # Threshold resized image and apply template matching
    thresh = cv2.threshold(resized, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    detected = cv2.matchTemplate(thresh, template, cv2.TM_CCOEFF)
    (_, max_val, _, max_loc) = cv2.minMaxLoc(detected) ```

Also, a comprehensive guide extracting text blocks from an image (without using template matching) can be found in nathancy's answer in this thread.

avgJoe
  • 832
  • 7
  • 24
0

I would just have changed

loc = np.where(result == np.max(result))

this gives me the best matching positions, and then I can choose only one if I want to...

thesylio
  • 144
  • 7