TL;DR: How to select a paragraph on an image in such a way that it doesn't contain adjacent (top and bot) paragraphs?
I have a set of scanned images which are single columns of text, such as this one. These images are all black and white, already rotated, their noise is reduced and have white spaces trimmed.
What I want to do is divide each of such images into paragraphs. My initial idea was to measure average brightness of each row to find spaces between lines of text and try to select a rectangle starting from that line to match the indentation and measure the brightness of that rectangle. But that seems a tad cumbersome.
Moreover, the lines are sometimes slightly skewed (up to ≈ 10 px vertical difference on extreme ends), so sometimes there is line overlap. So I thought of selecting all letters of a paragraph and using them to plot a block of text and I got this using this method, but I am not sure how to proceed further. Select each letter rectangle starting n
pixels from the left, and try to include every rectangle starting no less than first_rectangle_x - offset
? But what then?