1

I have a technical drawing in a PDF-Format and I want to search for very speific values especially the diameter sign in the pdf drawing. I use ocrmypdf which in itself uses Tesseractocr, sometimes it gets right sometimes it doesent but I can't exlplain myself why for my personal eye it is very different from the rest of the Symbols. I uploaded a picture so you can see what I mean. Is there any chance to optimze the ocr tool to get better results? Picture of the diameter symbol

I tried to whitelist the diameter symbol to my code but if I undestand the Output right it only has wihtelisted the numbers 1 to 9 but ignores the diamater sign. Is there anything wrong with my code or something else?

The value I get from the new searchable pdf is this: "218 -0,4" it seems like that the diameter sign has been chnage to a 2 which I cant really explain.

import ocrmypdf


input_file = "C:/input.PDF"
​

output_file = "C:/output1.pdf"
​

ocrmypdf.ocr(input_file, output_file, deskew=True, force_ocr=True, tesseract_config='--psm 6 -c tessedit_char_whitelist="0123456789ø"')
​
Output
Scanning contents: 100%|██████████| 1/1 [00:00<00:00,  8.44page/s]
OCR:   0%|          | 0.0/1.0 [00:00<?, ?page/s][tesseract] read_params_file: Can't open --psm 6 -c tessedit_char_whitelist="0123456789"
OCR: 100%|██████████| 1.0/1.0 [00:11<00:00, 11.84s/page]
PDF/A conversion: 100%|██████████| 1/1 [00:01<00:00,  1.31s/page]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
The output file size is 7.83× larger than the input file.
Possible reasons for this include:
--force-ocr was issued, causing transcoding.
--deskew was issued, causing transcoding.
The optional dependency 'jbig2' was not found, so some image optimizations could not be attempted.
The optional dependency 'pngquant' was not found, so some image optimizations could not be attempted.
PDF/A conversion was enabled. (Try `--output-type pdf``
Nick Stankat
  • 11
  • 1
  • 3
  • If you squint a little, then that diameter symbol is round on top, has a diagonal and something at the bottom, just like a 2 – Hans Kesting Apr 23 '23 at 14:31
  • @KJ I think if you use it as a picture then it will work as a diamter for sure. But if you think about a technical drawing there is so much other numbers on the drawing. Can you suggest another ocrtool for numeric tasks? – Nick Stankat Apr 23 '23 at 15:32

0 Answers0