Why if i extract image jpg from pdf with wand, it turn me a black background over the text

Question

I have a problem with some pdf files. I need to convert them into jpg images making them available for the OCR, but when I convert some of them, Wand turn me jpg where there is a black background over the text. I saw that it is a common problem about the space colors. It seems to happen with files word converted into pdf files where the space colors became CMYK. Tesseract OCR accept only the space color RGB. I have already written a python script that convert but I’d like to solve this problem. Could you help me? Thanks. Original page pdf Converted pdf to jpg

Can you post some pictures of expected results and the actual results? — theblackips, Apr 22 '19 at 10:40
What code are you using that produced this issue? Have you tried using an online conversion site? — rassar, Apr 22 '19 at 21:01
I post under the code that I use. I can't use online conversione site because it's a work project so I'm trying to solve this problem. — Danilo Giovannico, Apr 22 '19 at 22:45
If your original PDF has transparency and you are trying to save to JPG, then it will be black, since JPG does not support transparency. So either save to PNG or flatten your rasterized PDF over a background of white. Can you post a link to your original PDF and not a PNG equivalent? — fmw42, Apr 28 '19 at 19:40
@DaniloGiovannico The code you've posted as an accepted answer is the code you already use that doesn't work, or new code that does? — jtlz2, May 22 '19 at 10:48
@fmw42 So how to do that using Wand as the OP asks?? Or another way..? — jtlz2, May 22 '19 at 10:49
Possible duplicate of [Python Wand converts from PDF to JPG background is incorrect](https://stackoverflow.com/questions/20439234/python-wand-converts-from-pdf-to-jpg-background-is-incorrect) — jtlz2, May 22 '19 at 10:58

score 1 · Answer 1 · edited May 22 '19 at 10:58

1

The solution is to set these before you call save:

page = wi(image=img)

page.background_color = Color('white')
page.alpha_channel = 'remove'

page.save(...)

Thanks to this Stack Overflow answer.

edited May 22 '19 at 10:58

double-beep

5,031
17
33
41

answered May 22 '19 at 10:57

jtlz2

7,700
9
64
114

score 0 · Accepted Answer · edited May 22 '19 at 10:59

This is my code:

def convert_pdf(pdf_file):

    # Get name file
    title = os.path.splitext(os.path.basename(pdf_file))[0]
    basename = os.path.basename(pdf_file)
    pdf = wi(filename=pdf_file, resolution=100)
    pdfImage = pdf.convert("jpg")
    outputPath = PATH_IMAGES+"/" + basename
    if not os.path.exists(outputPath):
        os.mkdir(outputPath)

    i=1
    for img in pdfImage.sequence:
        page = wi(image=img)
        page.save(filename=outputPath+"/"+title+"(*page="+str(i)+"*)"+".jpg")
        imagePathConverted = outputPath+"/"+title+"(*page="+str(i)+"*)"+".jpg"
        '''image = Image.open(imagePathConverted)

        if image.mode != 'RGB':
            rgb_image = image.convert('RGB')
            rgb_image.save(imagePathConverted)'''
        i += 1

    return outputPath

Why if i extract image jpg from pdf with wand, it turn me a black background over the text

2 Answers2