Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.
Questions tagged [pdf-extraction]
148 questions
36
votes
12 answers
How to check if PDF is scanned image or contains text
I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…

Jinu Joseph
- 542
- 1
- 4
- 17
34
votes
2 answers
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably…

dave walker
- 3,058
- 1
- 24
- 30
22
votes
10 answers
How to extract text from pdf in Python 3.7
I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just…

RaV1oLLi
- 529
- 1
- 3
- 9
14
votes
3 answers
How to extract text under specific headings from a pdf?
I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.
How can I do this?

AlfiyaFaisy
- 314
- 1
- 3
- 15
12
votes
5 answers
How to export pdf form fields to xml automatically
I have a pdf file including form fields and need to export the data into a xml file AUTOMATICALLY. Here is a screen of a sample form I created for testing:
Note: It works great exporting it MANUALLY using Acrobat Professional by clicking on Tools >…

Michael
- 3,982
- 4
- 30
- 46
11
votes
3 answers
How to improve Hindi text extraction?
I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn't work, but no answers as such. So, I decided to convert the PDF to an image, and then use…

Abhishek Rai
- 2,159
- 3
- 18
- 38
8
votes
0 answers
get X,Y co-ordinates of the selected area from PDF
I'm trying to extract text from a particular section of a PDF. If I know the X,Y co-ordinates of the area, I'm able to extract the text. But I'm unable to get the co-ordinates of the selected area from PDF. Kindly help me If anyone tried this…

Sasikumar
- 675
- 2
- 7
- 17
7
votes
1 answer
How to extract the contents of a table in pdf file?
I want to extract the contents of a table in pdf like like this :
i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table
import…

Bertrand
- 341
- 1
- 2
- 12
6
votes
1 answer
How to extract images and image BBox coordinates using python?
I am trying to extract images in PDF with BBox coordinates of the image.
I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coordinates, i am not sure if those are correct bbox…

Satyaaditya
- 537
- 8
- 26
6
votes
2 answers
Find PDF Dimensions with Camelot
I am using Camelot to read complete PDFs and extract about 112 attributes from each one.
I use table areas to extract the attributes
test_variable = camelot.read_pdf(filename, flavor='stream',
table_areas=['38, 340 ,50, 328'])…

A.A. F
- 349
- 5
- 16
6
votes
1 answer
PyPDF2 to extract vertical text from scanned pdf
I am trying to extract text from the scanned pdf using PyPDF2. Some of the pdf contains text aligned vertically. But the orientation of the page is Portrait. Is there any way to identify if the text is vertically aligned and read vertical lines in…

Mms
- 91
- 4
4
votes
3 answers
Problems to extract table data using camelot without error message
I am trying to extract tables from this pdf link using camelot, however, when a try this follow code:
import camelot
file = 'relacao_medicamentos_rename_2020.pdf'
tables =…

Gabriel Souto
- 600
- 7
- 19
4
votes
2 answers
Pdfplumber cannot recognise table python
I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.
How can I get the table?
link…

Joan Mok
- 91
- 3
- 9
4
votes
2 answers
How to get only the first match of a RegEx (UiPath Studio RegEx Based Extractor)
I have the following text that I extracted from a PDF using UiPath Studio's OCR. It's the same block of text repeated 3 times due to it being the original, duplicate & triplicate of the same PDF page.
Os bens/serviços foram colocados à disposição do…

lcvalves
- 77
- 1
- 9
4
votes
4 answers
How to retrieve ALL pages from PDF as a single string in Python 3 using PyPDF2
In order to get a single string from a multi-paged PDF I'm doing this:
import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page =…

Gavrk
- 295
- 1
- 4
- 16