2

I have a problem with some pdf files. I need to convert them into jpg images making them available for the OCR, but when I convert some of them, Wand turn me jpg where there is a black background over the text. I saw that it is a common problem about the space colors. It seems to happen with files word converted into pdf files where the space colors became CMYK. Tesseract OCR accept only the space color RGB. I have already written a python script that convert but I’d like to solve this problem. Could you help me? Thanks. Original page pdf Original page pdf Converted page pdf to jpg Converted pdf to jpg

  • Can you post some pictures of expected results and the actual results? – theblackips Apr 22 '19 at 10:40
  • Yes sure. I edit the post adding the photos – Danilo Giovannico Apr 22 '19 at 20:50
  • What code are you using that produced this issue? Have you tried using an online conversion site? – rassar Apr 22 '19 at 21:01
  • I post under the code that I use. I can't use online conversione site because it's a work project so I'm trying to solve this problem. – Danilo Giovannico Apr 22 '19 at 22:45
  • 1
    If your original PDF has transparency and you are trying to save to JPG, then it will be black, since JPG does not support transparency. So either save to PNG or flatten your rasterized PDF over a background of white. Can you post a link to your original PDF and not a PNG equivalent? – fmw42 Apr 28 '19 at 19:40
  • @DaniloGiovannico The code you've posted as an accepted answer is the code you already use that doesn't work, or new code that does? – jtlz2 May 22 '19 at 10:48
  • @fmw42 So how to do that using Wand as the OP asks?? Or another way..? – jtlz2 May 22 '19 at 10:49
  • Possible duplicate of [Python Wand converts from PDF to JPG background is incorrect](https://stackoverflow.com/questions/20439234/python-wand-converts-from-pdf-to-jpg-background-is-incorrect) – jtlz2 May 22 '19 at 10:58

2 Answers2

1

The solution is to set these before you call save:

page = wi(image=img)

page.background_color = Color('white')
page.alpha_channel = 'remove'

page.save(...)

Thanks to this Stack Overflow answer.

double-beep
  • 5,031
  • 17
  • 33
  • 41
jtlz2
  • 7,700
  • 9
  • 64
  • 114
0

This is my code:

def convert_pdf(pdf_file):

    # Get name file
    title = os.path.splitext(os.path.basename(pdf_file))[0]
    basename = os.path.basename(pdf_file)
    pdf = wi(filename=pdf_file, resolution=100)
    pdfImage = pdf.convert("jpg")
    outputPath = PATH_IMAGES+"/" + basename
    if not os.path.exists(outputPath):
        os.mkdir(outputPath)

    i=1
    for img in pdfImage.sequence:
        page = wi(image=img)
        page.save(filename=outputPath+"/"+title+"(*page="+str(i)+"*)"+".jpg")
        imagePathConverted = outputPath+"/"+title+"(*page="+str(i)+"*)"+".jpg"
        '''image = Image.open(imagePathConverted)

        if image.mode != 'RGB':
            rgb_image = image.convert('RGB')
            rgb_image.save(imagePathConverted)'''
        i += 1

    return outputPath
double-beep
  • 5,031
  • 17
  • 33
  • 41