extract text and labels from PDF document

Question

I am trying to detect and extract the "labels" and "dimensions" of a 2D technical drawing which is being saved as PDF using python. I came across a python library call "pytesseract" which has optical character recognition capability. I tried the demo on my image but it fails to detect most of the label/dimensions. Please suggest if there is other way to do it. Thank you**.

** Attached is a sample of the 2D technical drawing I try to detect

** what I am trying to achieve is to able to obtain the coordinate of every dimensions (the 160,120,10 4x45 etc) on the image, and extract the, as well.

https://stackoverflow.com/a/45791457/7919597 – Joe Mar 07 '20 at 10:29 — Joe, Mar 07 '20 at 10:29

score 0 · Answer 1 · answered May 08 '20 at 22:16

About 16 months ago we asked ourselves the same question. If you want to implement it yourself, I'd suggest the following process:

Extract the Canvas from the sheet
Separate the Cuts
Detect the Measure Regions on each Cut
Detect the individual attributes of the Measure Regions to understand where the Measure Start & End. In your particular example that's relatively easy.
Run the detected Measure Labels through OCR
Associate the Labels to the Measures
Verify your results

Alternatively you can also run it through our API and get the results as JSON.

Here's a quick visualization of the result: Drawing Read (GT stands for General Tolerances)

You will find more information here: https://werk24.github.io/docs/. There will be a major release later this month (BTW: it's a commercial API). — jmattes, May 09 '20 at 19:19

extract text and labels from PDF document

1 Answers1