Questions tagged [document-layout-analysis]

23 questions
8
votes
2 answers

text layout recognition with python

I'm trying to sort through several thousand scanned files and sort them into folders based on type (ie: if one of the files is a scanned copy of formA, then it should go in the formA folder, if it's a scanned copy of formB, then it should go in the…
danwoods
  • 4,889
  • 11
  • 62
  • 90
7
votes
2 answers

Extracting data from tables without any grid lines and border from scanned image of document

Extracting table data from digital PDFs have been simple using camelot and tabula. However, the solution doesn't work with scanned images of the document pages specifically when the table doesn't have borders and inner grids. I have been trying to…
6
votes
1 answer

how to detect orientation of a scanned document?

I'd to detect and, if necessary, correct the orientation of a scanned document image. I am already able to deskew documents, however it still might occur, that a document is upside down and it needs to be rotated by 180°. Using tesseract's layout…
Pedro
  • 4,100
  • 10
  • 58
  • 96
6
votes
5 answers

Document Layout Analysis for text extraction

I need to analyze the layout structure of different documents type like: pdf, doc, docx, odt etc. My task is: Giving a document, group the text in blocks finding the correct boundaries of each. I did some tests using Apache Tika, which is a good…
5
votes
2 answers

How to extract data from invoices in tabular format

I'm trying to extract data from pdf/image invoices using computer vision.For that i used ocr based pytesseract. this is sample invoice you can find code for same below import pytesseract img = Image.open("invoice-sample.jpg") text =…
5
votes
2 answers

Determining which are the text and graphic regions in an image

I dont know whether should I post this question here or not? But if someone knows it, please answer? What are the algorithms for determining which region in an image is text and which one is graphic? Means how to separate such regions? (figure or…
3
votes
2 answers

Tesseract: How to export text and boundingboxes?

I'd like to convert document images to XML and also export the location where a certain word has been found within a page. In order to access bounding box information, tesseract's layout analysis can be used: tess.SetImage(...); …
Pedro
  • 4,100
  • 10
  • 58
  • 96
2
votes
2 answers

How do I separate the paragraphs of text in a scan of a two-column layout text document?

I have the above image and would like to slice them into individual questions. I would like to do it programmatically using python and image libraries.
1
vote
0 answers

DatasetGenerationError: An error occurred while generating the dataset

Im trying to load my Publaynet dataset from s3 bucket to data bricks using huggingface datasets like this: dataset_id = "/dbfs/mnt/ocr/dataset/publaynet" dataset = load_dataset(dataset_id, data_files={"train":…
1
vote
2 answers

Determine the angle of text on an image

I would like to determine the angle of inclination of the text in my PDF document (in order to align this document as a result). I receive a PDF document scanned by people, and accordingly, this document will not be perfectly aligned. There are…
Paul
  • 53
  • 3
  • 21
1
vote
5 answers

How to detect figures in a paper news image in Python?

So i have this project in Python (Computer Vision), which is seperating text from figures of an image (like a paper news image). My question is what's the best way to detect those figures in the paper ? (in Python). Paper image example : Paper…
1
vote
0 answers

How i get OCR PDF layout with AWS textract API..?

We Plan to use AWS Textract service for document analysis. presently result coming in bounding boxes format. anyone know how to get exact pdf layout with this service? OCR Pdf document text Extraction for document Analysis jobId =…
0
votes
1 answer

Divide an image into tiles based on text structure in Python OpenCV

I'm a beginner to computer vision and OpenCV, but I do have moderate experience with Python. I am trying to write a program that takes an image and divides the image into tiles based on the structural organization of the text. For example, given a…
0
votes
1 answer

Form Recognizer in deep Learning with Annotation

I have regular digital forms with blanks, boxes, checkboxes, tables, and signature fields. My aim is to extract the field name along with its fillable coordinates. For e.g. if form has a field named "Name of benificiary" and has it's corresponding…
0
votes
0 answers

How to draw a red rectangle around a string of words using pytesseract that has an incorrect spelling

This is my image: test2.png I can recognize the words: recognize text I need to check if there is an incorrect word/s (incorrect spelling) in the text image, highlight this word/s with a red color rectangle and display an "x" above indicating that…
1
2