extract dimensions from PDF using OCR

Question

I am looking for a way to programmatically examine a pdf cad drawing, plain 2D print, and pull out all the dimensions along with the locations of the dimensions on the page. I am in search of technologies that will allow me to do this.

I'm looking at leadtools, PDFBox, iText, TET, Adobe SDK and trying to do some comparison among them. I am particularly interested in recognizing dimensions/numbers and shapes accurately and the api must have ability to extract location info as well. Any past experiences with any of these or helpful insight on the good ones/bad ones would be greatly appreciated!!

score 0 · Answer 1 · answered Apr 13 '17 at 18:16

We can provide relevant information about the LEADTOOLS part of your question since it's our product.

If the PDF contains actual text and not just an image of text, you can extract it directly without going through OCR. To do that, use the Leadtools.Pdf.PDFDocument.ParsePages() method.

If you’re dealing with images that contain both text and non-text areas, you could use Leadtools.ImageProcessing.Core.AutoZoningCommand to isolate the text zones (areas) and get their coordinates. You could then use either our OCR engine or your own code. If you try this and don’t get satisfactory results, there could be other advanced options to help you, but we might need to see actual samples you’re working with. If you like, email some sample files to our support address and mention what you tried so far.

extract dimensions from PDF using OCR

1 Answers1