Questions tagged [hocr]

hOCR is an open standard which defines a data format for representation of OCR output.

hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself. Embedding this data into text in the standard HTML format is used to achieve that goal.

Public Specification for the hOCR Format

31 questions
38
votes
6 answers

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants…
James Owers
  • 7,948
  • 10
  • 55
  • 71
14
votes
3 answers

HOCR to HTML for visualizing

How to convert hOCR to HTML for visualization? If you open the raw hOCR file its only rendered as plain text (the elements are not positioned)
clarkk
  • 27,151
  • 72
  • 200
  • 340
12
votes
2 answers

Convert hOCR to HTML table

I am looking for a tool or an idea to be implemented in python that convert hOCR file (generated by tesseract in by application) to html table. The idea is to utilize the text location information in hOCR file (provided in bbox attribute) to create…
azri.dev
  • 311
  • 3
  • 8
7
votes
3 answers

Not able to understand coordinate in extracted document using OCR engine tesseract

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document. Problem description: - It showing coordinates but let me know that are these coordinates…
S.P Singh
  • 1,267
  • 3
  • 17
  • 23
7
votes
2 answers

Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

In the Tesseract FAQ they say you can: How can I get the coordinates and confidence of each character? There are two options. If you would rather not get into programming, you can use Tesseract's hocr output format (read the Tesseract manual page…
sashoalm
  • 75,001
  • 122
  • 434
  • 781
5
votes
4 answers

How to get Hocr output using python-tesseract

I had been getting really good results using pytesseract but it is not able to preserve double spaces and they are really important for me. And, so i decided to retrieve hocr output rather than pure text.But;there doesn't appear to be any way of…
Anurag
  • 59
  • 1
  • 1
  • 6
4
votes
1 answer

Extract data from tesseract hocr xhtml file

I'm trying to use Python to extract data from Tesseract's hocr output file. We're limited to tesseact version 3.04, so no image_to_data function or tsv output is available. I have been able to do it with beautifulsoup and in R, but that's neither…
GeorgeR90
  • 137
  • 1
  • 9
3
votes
1 answer

Detecting bold (and italic) text in an image

I want to detect stretches of bold (and perhaps italic) text in images of pages--think TIFFs, or image PDFs. I need pointers to any open source software that does that. Here's a picture of a dictionary entry (from a Tzeltal--Spanish dictionary)…
Mike Maxwell
  • 547
  • 4
  • 11
3
votes
0 answers

Meaning of x_descenders and x_ascenders in hOCR file?

Here's the line from Tesseract 4 output (.hocr file): What's the meaning of x_descenders and x_ascenders…
dzieciou
  • 4,049
  • 8
  • 41
  • 85
3
votes
0 answers

How to enable hocr font info in tesseract 4?

I'm using tessseract 4 on ubuntu 16.04. so when using hocr feature in tesseract and after activating font info in hocr config file (hocr_font_info 1) I'm still not getting " x_font "info. Is there any other way to enable font info in tesseract4?
hamma
  • 129
  • 2
  • 14
2
votes
0 answers

Generate hOCR from Microsoft Computer Vision OCR

I'm using the Microsoft Read API to derive OCR data from local images. My script is based on this tutorial: https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts/python-hand-text?tabs=version-2. While I can obtain…
paratext
  • 53
  • 6
2
votes
1 answer

getting hocr output from tika-server

I am doing OCR to a PDF file using Apache TIKA Server. I am interested in the hOCR output, but only succeed to get the output in plain text format. Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers.…
Amnon
  • 2,212
  • 1
  • 19
  • 35
2
votes
1 answer

What are the strategies to convert an HOCR output to a string (for regex purposes)?

I am working with Pytesseract and would like to convert an HOCR output to a string. Of course, such a function is implemented into Pytesseract but I would like to know more about the possible strategies to get it done thx from pytesseract import…
2
votes
0 answers

Limit space size in Tesseract

I write in Python, using pytesseract or direct Popen calls if needed. I try to OCR a document with irregular structure, a letter looking like this: The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right…
Dedalus
  • 340
  • 2
  • 10
2
votes
1 answer

hOCR Files with Tesseract / Determining if a PDF has high quality text layers

I have a Tesseract 4.0 setup we are using with an LSTM model for OCR; incoming scanned PDFs are deconstructed into individual 300dpi upsampled PNGs, then deskewed and OCR'ed, then re-assembled into a PDF with text layers while also saving each page…
Greg Perry
  • 151
  • 10
1
2 3