How to recognize Text-Presence pattern in a scanned image and crop it?

Question

Smart Cropping for Scanned Docs

Recently I took over a preservation project of old books/manuscripts. They are huge in quantity, almost 10,000 pages. I had to scan them manually with a portable scanner as they were not in a condition to be scanned in an automated book scanner.

The real problem shows up when I start editing them in Photoshop. Note that all of them are basically documents (in JPG format) and that there are absolutely no images in those documents. They are in a different language (Oriya) for which I am sure there won't be any OCR software available in near future. (If there is please let me know.)

To make those images (docs) look clean and elegant I have to crop them, position them, increase contrast a bit, clean unnecessary spots with eraser, et cetera. I was able to automate most of these processes in Photoshop, but cropping is the point where I am getting stuck. I can't automate cropping as the software can't recon the presence of text or content in a certain area of that img (doc); it just applies the value given to it for cropping.

I want a solution to automate this cropping process. I have figured out an idea for this, I don't know if it's practical enough to implement and as far as I know there's no software present in market that does this kind of thing.

The possible solution to this: This might be possible if a tool can recognize the presence of text in an image (that's not very critical as all of them are normal document images, no images in them, no patterns just plain rectangles) and crop it out right from the border of those text from each side so it can output a document image without any margin. After this rest of the tasks can be automated using Photoshop such as adding white spaces for margin, tweaking with the contrast and color make it more readable etc.

Here is an album link to the gallery. I can post more sample images if it would be useful - just let me know.

http://imageshack.us/g/1/9800204/

Here is one example from the bigger sample of images available through above link:

one example of a bigger set...

It won't be possible to come up with a solution without any idea just **how** your JPEG scans look like? Can you please provide (a link to) a sample of 3-4 pages of your scans? (I might be able to come up with an ImageMagick-based solution....) — Kurt Pfeifle, Oct 07 '12 at 08:46
I don't see the links. Expect better quality answers if you post several photos that approximate the range of variation you expect to see. That said, I'll give it a shot. — Rethunk, Oct 08 '12 at 03:05
The links are only available if one logs in. I'm not going to register at Imageshack just for this in order to get access to the links. I'm the one who provided f.e. [this answer](http://stackoverflow.com/a/11987620/359307) (just so you know what level of quality in answers you may be missing if you make it hard to get access to your pictures). — Kurt Pfeifle, Oct 08 '12 at 05:50
I made that album public and also tested the link. I donno why it didnt work. Also check out this tinypic image. /*Never knew that image sharing can be this much confusing.*/ — Dave, Oct 08 '12 at 06:24
ImageShack wants me to register in order to access the list of direct links. Why can't you just post these direct links, eh? — Kurt Pfeifle, Oct 08 '12 at 19:43
Sorry for that album prob. This is the img link. http://i46.tinypic.com/2epik4i.jpg — Dave, Oct 09 '12 at 10:41

score 11 · Answer 1 · edited May 23 '17 at 11:52

Using the sample from tinypic, original scan

with ImageMagick I'd construct an algorithm along the following lines:

Contrast-stretch the original image

Values of 1% for the the black-point and 10% for the white-point seem about right.

Command:

convert                               \
   http://i46.tinypic.com/21lppac.jpg \
  -contrast-stretch 1%x10%            \
   contrast-stretched.jpg

Result: contrast-stetched result

Shave off some border pixels to get rid of the dark scanning artefacts there

A value of 30 pixels on each edge seems about right.

Command:
```
convert                   \
   contrast-stretched.jpg \
  -shave 30x30            \
   shaved.jpg   
```
Result:

De-speckle the image

No further parameter here. Repeat process 3x for better results.

Command:

convert       \
   shaved.jpg \
  -despeckle  \
  -despeckle  \
  -despeckle  \
   despeckled.jpg

Result: despeckled image

Apply a threshold to make all pixels either black or white

A value of roughly 50% seems about right.

Command:
```
convert           \
   despeckled.jpg \
  -threshold 50%  \
   b+w.jpg
```
Result:
Re-add the shaved-off pixels

Using identify -format '%Wx%H' 21lppac.jpg established that the original image had a dimension of 1536x835 pixels.

Command:
```
convert            \
   b+w.jpg         \
  -gravity center  \
  -extent 1536x835 \
   big-b+w.jpg
```
Result: (Note, this step was only optional. It's purpose is to get back to the original image dimensions, which you may want in case you'd go from here and overlay the result with the original, or whatever...)
De-Skew the image

A threshold of 40% (the default) seems to work here too.

Command:
```
convert        \
   big-b+w.jpg \
  -deskew 40%  \
   deskewed.jpg
```
Result:
Remove from each edge all rows and colums of pixels which are purely white

This can be achieved by simply using the -trim operator.

Command:
```
convert         \
   deskewed.jpg \
  -trim         \
   trimmmed.jpg
```
Result:

As you can see, the result is not yet perfect:

there remain some random artefacts on the bottom edge of the image, and
the final trimming didn't remove all white-space from the edges because of other minimal artifacts;
also, I didn't (yet) attempt to apply a distortion correction to the image in order to fix (some of) the distortion. (You can get an idea about what it could achieve by looking at this answer to "Understanding Perspective Projection Distortion ImageMagick".)

Of course, you can easily achieve even better results by playing with a few of the parameters used in each step.

And of course, you can easily automate this process by putting each command into a shell or batch script.

Update

Ok, so here is a distortion to roughly rectify the deformation.

*Command:

convert                                                                         \
   trimmmed.jpg                                                                 \
  -distort perspective '0,0 0,0  1300,0 1300,0  0,720 0,720  1300,720 1300,770' \
   distort.jpg

Result: (once more with the original underneath, to make direct visual comparison more easy) un-distorted image original image

There is still some portion of barrel-like distortion in the image, which can probably be removed by applying the -barrelinverse operator -- we'd just need to find the fitting parameters.

I'm not certain, but I suspect that the distortion may be a little different from the top to bottom of the page (left to right in the images above). There could even be a combination of optical distortion (nonlinear with respect to radius from image center) and a sort of cylindrical or conical distortion from the page being curled. The text looks wider on the left side of the image than on the right side of the image. Can Image Magick perform an affine remapping or do a "quadrilateral warp" to remap a skewed quadrilateral to a rectangle? — Rethunk, Oct 12 '12 at 02:33
Thanks for the answer. How to apply these to multiple images as different images will be having diff. arrangements and positioning . Is there a way to automated recognition of image pattern in ImageMagick or any other program out there. Actually that's my requirement. The answer u provided would be really helpful as i am exploring Imagemagick these days but I am gonna have to ask more help from you. A solution to apply to Multiple Images. — Dave, Oct 13 '12 at 03:25
@Dave: The only step where I really adapted the parameters individually to the input file is the last one (`-distort perspective ...`), all the others you should be able to apply to multiple images. And about 'automatization': it's easy to write all these commands, one after the other, into a script and there you have your automatization... — Kurt Pfeifle, Oct 13 '12 at 07:52
@Dave: So you say the answer was 'really helpful', but it still didn't deserve your upvote? — Kurt Pfeifle, Oct 13 '12 at 07:55
@KurtPfeifle i am sorry. I am not that familiar with the features and culture of Stackoverflow. You deserved more than an upvote. Now back to point, i am still clueless about the solution you suggested. I mean it works fine but the problem is I am working here with a large amount of images and applying all those command-line codes to every single image isn't exactly the solution I am looking for. is there any other way around so that I can apply this filter to multiple images (by multiple I mean over 10,000 of them)? I am not sure if i am presenting this problem in a proper way to understand. — Dave, Nov 04 '12 at 04:51
@Dave: all of the individual commands I used (with the exception of the last one to partially un-distort the perspective distortion, which was individually constructed) can be put into a shell script. This way you can process thousands and even millions of images automatically... — Kurt Pfeifle, Nov 04 '12 at 11:05
Why using JPEG as an intermediary (lossy) format and loose the quality with each iteration, instead of using lossless format, like PNG for example? — Mladen B., Sep 13 '14 at 02:46
@MladenB.: Good point! (In theory you're right. In practise, this has to be tested out if it makes a real difference. But honestly, I hadn't thought of it at the time of quickly writing this answer, almost two years ago. It's a fact that after a few iterations of jpeg-based conversions with ImageMagick, the quality will converge to a stable level. So again: test, to see if your theory is true. (I guess it ***IS***, though the difference will not be too visible...) Thanks for the hint. — Kurt Pfeifle, Sep 14 '14 at 09:41

score 2 · Answer 2 · edited May 23 '17 at 12:33

One technique to segment text from the background is the Stroke Width Transform. You'll find several posts here on Stack Overflow about it, including this one:

Stroke Width Transform (SWT) implementation (Java, C#...)

If the text shown in the Wikipedia page is representative of written Oriya, then I'm confident that the SWT (or your customized version of it) will perform well. You may still have to do some manual tweaking after you review an image, but an SWT-based method should do a lot of the work for you.

Although the SWT may not identify every single stroke, it should give you a good estimate of the dimensions of the space occupied by strokes (and characters). The simplest method

A newish algorithm that might work for you is "content-aware resizing" algorithms such as "seam carving," which automatically removes paths of pixels of low information content (e.g. background pixels). Here's a video about seam carving:

http://www.youtube.com/watch?v=qadw0BRKeMk

There's a seam carving plugin ("liquid resizing") for GIMP: http://liquidrescale.wikidot.com/

This blog post reports a plugin for Photoshop: http://wordpress.brainfight.com/195/photoshop-cs5-content-aware-aka-seam-carving-aka-liquid-resize-fun-marketing/

For an overview of OCR techniques, I recommend the book Character Recogntion Systems by Cheriet, Kharma, Liu, and Suen. The references in that book could keep you busy for quite some time.

http://www.amazon.com/Character-Recognition-Systems-Students-Practitioners/dp/0471415707

Finally, consider joining the Optical Character Recognition group on LinkedIn to post more specific questions. There are academics, researchers, and engineers in the industry who can answer questions in great detail, and you might also be able to make contact via email with researchers in India who are developing OCR for languages similar to Oriya, though they may not have published the software yet.

Thanks for this resourceful answer, this will help me a lot. I have posted a sample image of the scanned doc. Please check it. Something new might strike your mind. Thanks again. — Dave, Oct 08 '12 at 09:45
I'll take a look and post some more comments later. Thanks for the link to the images. — Rethunk, Oct 08 '12 at 13:12
Performing a morphological "close" operation could make it easier to find a line that would fit the left edge and right edge of text on each page. To follow up Kurt's work on distortion: given the curl of the page, the lines at the left and right edges may not be parallel, so although a rectangular crop could leave just the text, the lines of text may still be curved, and that could present a problem for future Oriya OCR algorithms. — Rethunk, Oct 12 '12 at 02:53

score 2 · Accepted Answer · answered Nov 05 '12 at 16:29

2

We addressed many "smart cropping" issues in our open-source DjVu->PDF converter. The converter also allows you to load a set of scanned images instead of DjVu (just press SHIFT with Open command) and output a resulting set of images instead of PDF.

It is a free cross-platform GUI tool, written in Java.

image converter, smart crop and deskew

answered Nov 05 '12 at 16:29

zfr

339
4
11

That's a really nice tool and it would be totally helpful for me. But the problem here is that I dont know much about the DJVU format and the software available for it isn't very easy to work with. I mean they are basically command line tools (correct me if I am wrong). How to feed JPEG images to this software so that it can take it in, convert it into djvu, take command from the user on the layout, margin and output as pdf or jpeg. Your work is phenomenal and I hope you can make this happen. I really really need your intel here. And also I am noticing some loss in the quality of the image. – Dave Nov 27 '12 at 09:52

How to recognize Text-Presence pattern in a scanned image and crop it?

Smart Cropping for Scanned Docs

3 Answers3

Update