How to deskew a scanned text page with ImageMagick?

Question

I have scanned documents that weren't scanned perfectly straight so the text is not orientated perfectly horizontally, i.e. perhaps 10° of a slope on each line.

My understanding is that the deskew option in ImageMagick should solve this, for example

convert skewed_1500.jpeg -deskew 40% skewed_1500_not.jpg

but it doesn't have any noticeable effect on the output file.

I've attached the skewed and deskewed images for comparison.

First the original image:

Then the purportedly deskewed image:

score 13 · Accepted Answer · answered Jan 09 '17 at 10:55

13

I would try a bigger value like 80% otherwise an Imagemagick forum member has a bash script that may be better: http://www.fmwconcepts.com/imagemagick/textdeskew/index.php

answered Jan 09 '17 at 10:55

Bonzo

5,169
1
19
27

Excellent, your 80% suggestion did the job perfectly. I also tried the script that you linked to and the bare script, without playing with parameters, did deskew somewhat but nt as perfectly as your 80% suggestion. Many thanks, this one has gone into the toolbox. – carbontracking Jan 24 '17 at 14:51
3

What exactly does the percentage mean? I can understand an angle, but a percentage makes no sense to me. Also, no matter how high I set the value, convert doesn't do anything for me. – polemon Mar 31 '19 at 04:11
I tested with a form flipped 90 degrees clockwise and the textdeskew script was not able to perform the proper orientation. The page was flipped down, as the documentation warns. – eduardosufan Jul 27 '23 at 18:50

Matthias Braun · Answer 2 · 2021-07-01T15:27:15.333

with OCRmyPDF

You can also straighten the pages after first having ImageMagick convert your JPG to PDF (convert input.jpg input.pdf) and then letting OCRmyPDF rectify the PDF:

ocrmypdf --deskew --tesseract-timeout=0 input.pdf output.pdf

Using your example page, I'd say the resulting text is straight:

straightened page, after running OCRmyPDF

As documented here, --tesseract-timeout=0 disables optical character recognition.

Of course you can also deskew the PDF and make it searchable in one go:

ocrmypdf --deskew -l fra input.pdf output.pdf

Make sure to have the French language pack from Tesseract installed before running this. Here are instructions.

Crop the PDF

To get rid of the black parts on the sides and the white part on the bottom of the PDF, you can use pdfcrop (commonly part of TeX Live):

# Remove margins at left, top, right, and bottom
pdfcrop --margins '-60 0 -50 -430' output.pdf cropped_output.pdf

The cropped and deskewed PDF:

PDF cropped with pdfcrop

score 0 · Answer 3 · answered Feb 24 '22 at 15:55

This doesn't use Imagemagick but it does the same job of deskew-ing the scanned document/image.

Following is the piece of code that can help you deskew the image:

import numpy as np
from skimage import io
from skimage.transform import rotate
from skimage.color import rgb2gray
from deskew import determine_skew
from matplotlib import pyplot as plt

def deskew(_img):
    image = io.imread(_img)
    grayscale = rgb2gray(image)
    angle = determine_skew(grayscale)
    rotated = rotate(image, angle, resize=True) * 255
    return rotated.astype(np.uint8)

def display_before_after(_original):
    plt.subplot(1, 2, 1)
    plt.imshow(io.imread(_original))
    plt.subplot(1, 2, 2)
    plt.imshow(deskew(_original))

display_before_after('img_35h.jpg')

Reference and Source: http://aishelf.org/deskew/

score 0 · Answer 4 · answered Feb 24 '22 at 16:21

0

You have the right syntax in Imagemagick, but just increase the percentage to 60%.

Input:

convert skewed_1500.jpeg -deskew 60% x.jpg

answered Feb 24 '22 at 16:21

fmw42

46,825
10
62
80

How to deskew a scanned text page with ImageMagick?

4 Answers4

with OCRmyPDF

Crop the PDF

Linked