3

I want to detect stretches of bold (and perhaps italic) text in images of pages--think TIFFs, or image PDFs. I need pointers to any open source software that does that.

Here's a picture of a dictionary entry (from a Tzeltal--Spanish dictionary) illustrating such text:

enter image description here

First line has bold, then italics, then "normal"; second has a couple words in bold, then a couple in normal font. The formatting represents implicit structure: bold is for headwords, italics is for part of speech, and normal is for most other things. Without knowing what's bold/ italic/ normal, it's impossible to parse these entries into structured text (like XML).

When our dictionary parsing project was active several years ago, we were using Tesseract version 3 to OCR the images, with the hocr output to give us positional information on the page (crucial to e.g. separating out different entries in the dictionary). The hocr output also included tags 'strong' for bold and 'em' for italics. While the 'em' tagging was reasonably accurate, the 'strong' tagging was almost random. And now version 4 of Tesseract doesn't even try (see also). You can still tell tesseract to use the old engine, but as I say, that seems to be completely inaccurate, at least on the text we fed it.

It doesn't seem like distinguishing bold vs. normal text should be hard; I can stand far away from my monitor and pick out the bold and non-bold stretches even though I can't read the words at that distance. (I suppose telling whether an entire text was bold or non-bold would be harder, but distinguishing them when both appear seems easy--for humans.)

I am told that ABBYY FineReader outputs information on font style, but for various reasons that won't work for our application.

If there were a non-OCR way of distinguishing bold vs. non-bold text that would put bounding boxes around the bold text, we could probably match those stretches up with the bounding boxes for characters/ words that Tesseract outputs (allowing for a few pixels difference). I know there was research on this decades ago (also here), but is there any open source software that actually does it?

Mike Maxwell
  • 547
  • 4
  • 11

1 Answers1

2

I invented some script:

KERNEL = np.asarray([
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
], np.uint8)
KERNEL_ITALIC = np.asarray([
    [0, 0, 1, 1],
    [0, 0, 1, 1],
    [0, 0, 1, 1],
    [0, 1, 1, 0],
    [0, 1, 1, 0],
    [0, 1, 1, 0],
    [1, 1, 0, 0],
    [1, 1, 0, 0],
    [1, 1, 0, 0],
], np.uint8)

def pre_process_italic(img):
    img_f = cv2.flip(img, 1)

    img = cv2.erode(img, KERNEL_ITALIC, iterations=1)
    img = cv2.dilate(img, KERNEL, iterations=1)

    img_f = cv2.erode(img_f, KERNEL_ITALIC, iterations=1)
    img_f = cv2.dilate(img_f, KERNEL, iterations=1)
    img_f = cv2.flip(img_f, 1)
    return img, img_f

def apply_func_italic(bbox, original, preprocessed):

    b_1 = bbox[1]
    b_3 = bbox[3]
    b_0 = bbox[0]
    b_2 = bbox[2]

    a, b = np.mean(original[b_1:b_3, b_0:b_2]), np.mean(preprocessed[b_1:b_3, b_0:b_2])

    return get_ratio(a, b)

def get_ratio(a, b):
    return ((a - b) / (a + b + 1e-8)) * 2

this python function gets an image with text and make some opencv functions morphing processes. After that the function returns two images: original and processed, after this all what you need is have word's bounding boxes and loop through them and calculate ratio of 'ON' pixels on the original image to processed. There is "get_ratio" - it can be replaced to another metric. I have not found better metric yet.

andkot
  • 48
  • 6
  • 1
    So, anyway main idea is to use that circumstance that italic style has some asymmetry along vertical axes - it is reason why I use KERNEL_ITALIC on an original image and on horizontal flipped image – andkot Jul 03 '21 at 21:03
  • 1
    Thanks! I guess this is a partial answer, namely how to prevent italics from messing up the bold vs. non-bold metric. FWIW, I have a small team working on the larger problem--detecting bold text--and they've had some success by using the erode() function, which seems to make the ratio greater between bold and non-bold in its output. Without that, the pixel ratio is not very accurate, e.g. words with letters like 'i' and 'l' inherently have fewer black pixels than words like 'M' and 'W' (even in variable width fonts). – Mike Maxwell Jul 06 '21 at 21:50
  • 1
    I think you need to calculate metrics for bold for whole word. I use an algorithm to detect bold words and I have very good accuracy @MikeMaxwell – andkot Jul 20 '21 at 14:53
  • @andkot can you tell how? – Arnav Mehta Sep 23 '22 at 13:11