Using Python to scrape text from PDFs that encode their text to images

Question

My code is below. I've tried it on other PDFs and it was able to extract the text accurately.

pdfFileObj = open('test.pdf', 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pageObj = pdfReader.getPage(0) 
print(pageObj.extractText())

Specifically when I run the above code there is no output. The provider of the PDF tries to sell the data in the PDF, so it makes sense why they don't want it to be easily scraped. Just wondering what the best workaround is because I don't have 100k lying around.

If it helps it looks like the PDF was produced with pdfsharp.net. When I upload my PDF in Google Colab and assign it to a variable, a portion of the result of printing that variable is below.

{'test.pdf': b'%PDF-1.4\n%\xd3\xf4\xcc\xe1\n1 0 
obj\n<<\n/CreationDate(D:20190310110705-04\'00\')\n/Title(Efficiency Summary  
Player Name)\n/Creator(PDFsharp 1.32.2608-w \\(www.pdfsharp.net\\))\n/Producer(PDFsharp 1.32.2608-w \\(www.pdfsharp.net\\))\n>>\nendobj\n2 0 obj\n<<\n/Type/Catalog\n/Pages 3 0 R\n>>\nendobj\n3 0 obj\n<<\n/Type/Pages\n/Count 1\n/Kids[4 0 R]\n>>\nendobj\n4 0 obj\n<<\n/Type/Page\n/MediaBox[0 0 612 792]\n/Parent 3 0 R\n/Contents 5 0 R\n/Resources\n<<\n/ProcSet [/PDF/Text/ImageB/ImageC/ImageI]\n/XObject\n<<\n/I0 8 0 R\n>>\n>>\n/Group\n<<\n/CS/DeviceRGB\n/S/Transparency\n/I false\n/K false\n>>\n>>\nendobj\n5 0 obj\n<<\n/Length 62\n/Filter/FlateDecode\n>>\nstream\nx\x9c+\xe42T0\x00B]\x10eni\xa4\x90\x9c\x0bd\x1b\x18(\x84Tq\x15r\x15*\x98\x9a\x1aA\xe4\xcd\xcd\xcc\x14\x8c\x8d\x14\xcc\xcd\xcd@J\xf4=\r\x14\\\xf2\x15\x02\xb9@\x10\x00\xd8\xf3\r\xe0\nendstream\nendobj\n6 0 obj\n<<\n/Type/XObject\n/Subtype/Image\n/Length 159\n/Filter/FlateDecode\n/Width 900\n/Height 1250\n/BitsPerComponent 1\n/ImageMask true\n>>\nstream\nx\x9c\xed\xc11\x01\x00\x00\x00\xc2 \xfb\xa76\xc6\x1e`\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\xe8\'\xe0\x00\x01\nendstream\nendobj\n7 0 obj\n<<\n/Type/XObject\n/Subtype/Image\n/Length 6413\n/Filter/FlateDecode\n/Width 900\n/Height 1250\n/BitsPerComponent 8\n/ColorSpace/DeviceGray\n>>\nstream\nx\x9c\xed\xdd\x81z\xa2\xbc\x16\x05\xd0\xf7\x7f\xe9\xe4\xde\xbf\x85\xe4\x9c$X\xdb\xb1\x15t\xado\xa6U\x0c!\x02\xdb@\xb4R\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00

[john233035](https://stackoverflow.com/users/10157758/john233035) stated "The provider of the PDF tries to sell the data in the PDF, so it makes sense why they don't want it to be easily scraped." Just to be clear, you're asking for help to write code that helps steal intellectual property?! — Trenton McKinney, Aug 16 '19 at 17:37
I've rolled back your edit. Once you've posted here, the content belongs to this site, and you can't deface it by replacing the text with noise. Please spend some time reading the [help], especially the terms of use of this site. — Ken White, Dec 10 '19 at 04:06

Daniel Glynn · Answer 1 · 2019-08-16T16:15:32.563

This code might be useful to you, I used it for a previous project where I scraped data from a pdf. I'm not sure if you've tried using pytesseract. You can modify the for page in pages loop to extract specific pages. This code will turn the PDF into images, then use OCR processing and return a text file with the text found.

from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import os

def OCR(pdf):
    pdfName = pdf.split('.pdf')[0]
    pages = convert_from_path(pdf, 500)
    image_counter = 1
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        page.save(pdfName+filename, 'JPEG')
        image_counter = image_counter + 1
    filelimit = image_counter-1
    f= open(pdfName+".txt","wb")
    text = ''
    for i in range(1, filelimit + 1):
        filename = pdfName+"page_"+str(i)+".jpg"
        text += str(((pytesseract.image_to_string(Image.open(filename)))))
        text = text.replace('-\n', '')
        text = text.replace('\n',' \n')
        os.remove(pdfName+"page_"+str(i)+".jpg")
    f.write(text.encode('utf-8','replace'))
    f.close()
    return text

What is the environment you ran this code in? Right now I am using just a Jupyter Notebook. So wondering what the process was to enter the pdf as a variable into the OCR function? — stackoverflowjc, Aug 16 '19 at 16:09
Sorry about that, I am running this on Windows from Atom. The PDF variable given to the OCR function is the file path to the Pdf, a string. — Daniel Glynn, Aug 16 '19 at 16:13

score 0 · Answer 2 · answered Aug 16 '19 at 16:10

you're just seeing the raw bytes of a PDF file there, the fact they've put the "Info dict" at the top of the file, and hence seeing strings like \Creator, isn't guaranteed and just because it's a "linearised" file

doing something like Daniel suggested is the way to go, but his implementation might introduce additional artifacts. tesseract is OCR software and attempts to turn rasterized text back into characters. it might be better working directly with images in the PDF file, rather than rasterizing the whole page to an image. also encoding to a JPEG seems awkward, using a lossless format like PNG is probably going to do slightly better

generally I'd recommend using something like pytesseract, but something else, e.g. see here for getting at the images directly

Using Python to scrape text from PDFs that encode their text to images

2 Answers2