Questions tagged [hocr]

hOCR is an open standard which defines a data format for representation of OCR output.

hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself. Embedding this data into text in the standard HTML format is used to achieve that goal.

Public Specification for the hOCR Format

31 questions

votes

6 answers

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants…

ocr tesseract hocr

asked Feb 18 '15 at 18:27

James Owers

7,948
10
55
71

votes

3 answers

HOCR to HTML for visualizing

How to convert hOCR to HTML for visualization? If you open the raw hOCR file its only rendered as plain text (the elements are not positioned)

html ocr hocr

asked Jul 13 '16 at 20:35

clarkk

27,151
72
200
340

votes

2 answers

Convert hOCR to HTML table

I am looking for a tool or an idea to be implemented in python that convert hOCR file (generated by tesseract in by application) to html table. The idea is to utilize the text location information in hOCR file (provided in bbox attribute) to create…

python html html-table tesseract hocr

asked Jun 24 '15 at 14:45

azri.dev

votes

3 answers

Not able to understand coordinate in extracted document using OCR engine tesseract

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document. Problem description: - It showing coordinates but let me know that are these coordinates…

ocr tesseract text-extraction hocr

asked Aug 31 '13 at 16:38

S.P Singh

1,267
3
17
23

votes

2 answers

Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

In the Tesseract FAQ they say you can: How can I get the coordinates and confidence of each character? There are two options. If you would rather not get into programming, you can use Tesseract's hocr output format (read the Tesseract manual page…

ocr tesseract hocr

asked Apr 05 '13 at 08:24

sashoalm

75,001
122
434
781

votes

4 answers

How to get Hocr output using python-tesseract

I had been getting really good results using pytesseract but it is not able to preserve double spaces and they are really important for me. And, so i decided to retrieve hocr output rather than pure text.But;there doesn't appear to be any way of…

tesseract python-tesseract hocr

asked Dec 13 '15 at 06:10

Anurag

votes

1 answer

Extract data from tesseract hocr xhtml file

I'm trying to use Python to extract data from Tesseract's hocr output file. We're limited to tesseact version 3.04, so no image_to_data function or tsv output is available. I have been able to do it with beautifulsoup and in R, but that's neither…

python xhtml tesseract hocr

asked Jun 05 '18 at 14:10

GeorgeR90

votes

1 answer

Detecting bold (and italic) text in an image

I want to detect stretches of bold (and perhaps italic) text in images of pages--think TIFFs, or image PDFs. I need pointers to any open source software that does that. Here's a picture of a dictionary entry (from a Tzeltal--Spanish dictionary)…

ocr hocr

asked May 17 '21 at 22:33

Mike Maxwell

votes

0 answers

Meaning of x_descenders and x_ascenders in hOCR file?

Here's the line from Tesseract 4 output (.hocr file): What's the meaning of x_descenders and x_ascenders…

tesseract hocr

asked Dec 10 '19 at 17:42

dzieciou

4,049
8
41
85

votes

0 answers

How to enable hocr font info in tesseract 4?

I'm using tessseract 4 on ubuntu 16.04. so when using hocr feature in tesseract and after activating font info in hocr config file (hocr_font_info 1) I'm still not getting " x_font "info. Is there any other way to enable font info in tesseract4?

linux tesseract hocr

asked Jun 15 '17 at 15:38

hamma

votes

0 answers

Generate hOCR from Microsoft Computer Vision OCR

I'm using the Microsoft Read API to derive OCR data from local images. My script is based on this tutorial: https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts/python-hand-text?tabs=version-2. While I can obtain…

azure computer-vision ocr hocr

asked May 28 '20 at 21:17

paratext

votes

1 answer

getting hocr output from tika-server

I am doing OCR to a PDF file using Apache TIKA Server. I am interested in the hOCR output, but only succeed to get the output in plain text format. Following the wiki and the code, I am trying to configure Tesseract using X-Tika-OCR... HTTP headers.…

tesseract apache-tika tika-server hocr

asked Jan 09 '20 at 10:40

Amnon

2,212
1
19
35

votes

1 answer

What are the strategies to convert an HOCR output to a string (for regex purposes)?

I am working with Pytesseract and would like to convert an HOCR output to a string. Of course, such a function is implemented into Pytesseract but I would like to know more about the possible strategies to get it done thx from pytesseract import…

python python-tesseract hocr

asked Aug 09 '19 at 15:40

Maxime Georges

votes

0 answers

Limit space size in Tesseract

I write in Python, using pytesseract or direct Popen calls if needed. I try to OCR a document with irregular structure, a letter looking like this: The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right…

ocr python-tesseract text-recognition hocr

asked Sep 29 '18 at 19:26

Dedalus

votes

1 answer

hOCR Files with Tesseract / Determining if a PDF has high quality text layers

I have a Tesseract 4.0 setup we are using with an LSTM model for OCR; incoming scanned PDFs are deconstructed into individual 300dpi upsampled PNGs, then deskewed and OCR'ed, then re-assembled into a PDF with text layers while also saving each page…

tesseract hocr

asked Feb 14 '18 at 02:41

Greg Perry

2 3 Next