Converting Scanned PDF's to an Image

Question

I'm able to scan a JPG image using Tesseract, I'm able to scan a regular PDF using ITextSharp and get the text from those. But I can't find a way to either get the text from a scanned PDF with a .PDF extension, or convert a PDF to an image so I can then scan it with Tesseract. Are there any options that I'm missing? Thanks!

score 0 · Answer 1 · edited May 23 '17 at 12:01

Assuming that you have scanned the PDF document. Secondly assuming you have only text in the PDF document. You can generate an image from text from the following method

private Image DrawText(String text, Font font, Color textColor, Color backColor)
{
    //first, create a dummy bitmap just to get a graphics object
    Image img = new Bitmap(1, 1);
    Graphics drawing = Graphics.FromImage(img);

    //measure the string to see how big the image needs to be
    SizeF textSize = drawing.MeasureString(text, font);

    //free up the dummy image and old graphics object
    img.Dispose();
    drawing.Dispose();

    //create a new image of the right size
    img = new Bitmap((int) textSize.Width, (int)textSize.Height);

    drawing = Graphics.FromImage(img);

    //paint the background
    drawing.Clear(backColor);

    //create a brush for the text
    Brush textBrush = new SolidBrush(textColor);

    drawing.DrawString(text, font, textBrush, 0, 0);

    drawing.Save();

    textBrush.Dispose();
    drawing.Dispose();

    return img;

}

Reference: How to generate an image from text on fly at runtime

Converting Scanned PDF's to an Image

1 Answers1