4

Are there currently any services or software tools that use Google Cloud Vision as backend for OCRing scanned PDF files?

If not, how would one be able to use Google Cloud Vision to turn PDFs into OCRed PDFs? As far as I know, Cloud Vision currently supports PDF files, but it will output recognized text only as a JSON file. So it seems one would need to do the additional step of placing this converted text on top of the image inside the PDF outside of Google Cloud Vision, in a separate step.

Background:

I often have to convert scanned-document PDF files into PDF files containing an OCRed text layer. So far, I've been using Software like OCRKit or ABBYY FineReader. I tested the accuracy of these solutions against the text recognition abilities of Google Cloud Vision, and the latter came out far ahead.

WillToPower
  • 49
  • 1
  • 3
  • The OCR.space freemium [OCR API](https://ocr.space/ocrapi) supports PDF input and creates [searchable PDF](https://ocr.space/searchablepdf) out of them. The ocr quality is very good, albeit not as good as google cloud vision. But it's free. – Fabrice Zaks Sep 15 '18 at 10:57

3 Answers3

2

As others have mentioned, you need to use third party tools to do this.

First convert the google cloud vision response json to a hocr file using gcv2hocr:

gcv2hocr test.jpg.json output.hocr

Then use hocr-tools to stitch the hocr data to the pdf file. The below command will look in the 'imgdir' folder and merge .hocr and .jpg with the same name into pages in out.pdf.

hocr-pdf --savefile out.pdf <imgdir>
FBB
  • 326
  • 1
  • 12
0

As you well mentioned, the responses retrieved by Vision API are available only on a JSON format; therefore, it is required to include an additional step within your solution, by using third-party libraries, in order to create a PDF file based on the response's content.

In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Vision API feature request and notify to Google about this desired functionality.

Armin_SC
  • 2,130
  • 11
  • 15
  • [Here](https://issuetracker.google.com/issues/120628290) is the feature request. Please vote if you want google to add it. – Joakim Apr 27 '19 at 13:58
0

Solution for starting with a PDF and using Vision's document text detection:

gcv2hocr works for a very specific vision json format not the output from document text detection. I refactored that code to create the correct hocr.

Second issue is that hocr-pdf takes a jpeg image not a pdf to start with. I refactored hocr-pdf and included pdf2image.

image=convert_from_path(imageFilePath)[0]
image_obj=io.BytesIO()
image.save(image_obj,format='jpeg')
image_obj.seek(0)
image_obj=ImageReader(image_obj)
can=Canvas(savefile,pagesize=letter)
width,height=image.size
width *=  (72 / 200)
height *= (72 / 200)
can.setPageSize((width, height))
can.drawImage(image_obj, 0, 0, width=width, height=height)
load_invisible_font()
add_text_layer(can,height)
can.showPage()
can.save()

This is better than a pypdf2 solution because it deletes any existing hidden text layers first by converting it to an image.

manbearpig
  • 143
  • 1
  • 7