Highest Voted 'pdf-extraction' Questions

36

votes

12 answers

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…

asked Apr 16 '19 at 08:54

Jinu Joseph

542
1
4
17

34

votes

2 answers

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably…

pdf itext pdf-extraction

asked Mar 27 '14 at 00:08

dave walker

3,058
1
24
30

22

votes

10 answers

How to extract text from pdf in Python 3.7

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just…

python pdf python-3.7 pypdf pdf-extraction

asked Apr 19 '19 at 20:29

RaV1oLLi

529
1
3
9

14

votes

3 answers

How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'. How can I do this?

python-2.7 pdf document text-extraction pdf-extraction

asked Jan 05 '18 at 05:19

AlfiyaFaisy

314
1
3
15

12

votes

5 answers

How to export pdf form fields to xml automatically

I have a pdf file including form fields and need to export the data into a xml file AUTOMATICALLY. Here is a screen of a sample form I created for testing: Note: It works great exporting it MANUALLY using Acrobat Professional by clicking on Tools >…

java xml python-2.7 acrobat pdf-extraction

asked Jan 09 '14 at 00:40

Michael

3,982
4
30
46

11

votes

3 answers

How to improve Hindi text extraction?

I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn't work, but no answers as such. So, I decided to convert the PDF to an image, and then use…

python python-tesseract pdf-extraction

asked Jun 03 '21 at 06:06

Abhishek Rai

2,159
3
18
38

8

votes

0 answers

get X,Y co-ordinates of the selected area from PDF

I'm trying to extract text from a particular section of a PDF. If I know the X,Y co-ordinates of the area, I'm able to extract the text. But I'm unable to get the co-ordinates of the selected area from PDF. Kindly help me If anyone tried this…

pdf pdf.js pdf-extraction

asked Jun 25 '14 at 04:14

Sasikumar

675
2
7
17

7

votes

1 answer

How to extract the contents of a table in pdf file?

I want to extract the contents of a table in pdf like like this : i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table import…

java pdf itext text-extraction pdf-extraction

asked Jul 09 '15 at 22:00

Bertrand

341
1
2
12

6

votes

1 answer

How to extract images and image BBox coordinates using python?

I am trying to extract images in PDF with BBox coordinates of the image. I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coordinates, i am not sure if those are correct bbox…

python pypdf pdf-extraction pdfrw

asked Feb 06 '19 at 06:41

Satyaaditya

537
8
26

6

votes

2 answers

Find PDF Dimensions with Camelot

I am using Camelot to read complete PDFs and extract about 112 attributes from each one. I use table areas to extract the attributes test_variable = camelot.read_pdf(filename, flavor='stream', table_areas=['38, 340 ,50, 328'])…

python pdf-extraction python-camelot

asked Jan 14 '19 at 06:32

A.A. F

349
5
16

6

votes

1 answer

PyPDF2 to extract vertical text from scanned pdf

I am trying to extract text from the scanned pdf using PyPDF2. Some of the pdf contains text aligned vertically. But the orientation of the page is Portrait. Is there any way to identify if the text is vertically aligned and read vertical lines in…

python python-3.x pypdf pdfminer pdf-extraction

asked Sep 27 '18 at 05:53

Mms

91
4

4

votes

3 answers

Problems to extract table data using camelot without error message

I am trying to extract tables from this pdf link using camelot, however, when a try this follow code: import camelot file = 'relacao_medicamentos_rename_2020.pdf' tables =…

python ghostscript python-camelot pdf-extraction

asked Dec 30 '21 at 15:12

Gabriel Souto

600
7
19

4

votes

2 answers

Pdfplumber cannot recognise table python

I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want. How can I get the table? link…

python tabular pdf-extraction

asked Jul 20 '20 at 17:01

Joan Mok

91
3
9

4

votes

2 answers

How to get only the first match of a RegEx (UiPath Studio RegEx Based Extractor)

I have the following text that I extracted from a PDF using UiPath Studio's OCR. It's the same block of text repeated 3 times due to it being the original, duplicate & triplicate of the same PDF page. Os bens/serviços foram colocados à disposição do…

regex ocr uipath uipath-studio pdf-extraction

asked Jul 20 '20 at 13:57

lcvalves

77
1
9

4

votes

4 answers

How to retrieve ALL pages from PDF as a single string in Python 3 using PyPDF2

In order to get a single string from a multi-paged PDF I'm doing this: import PyPDF2 pdfFileObject = open('sample.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObject) count = pdfReader.numPages for i in range(count): page =…

python python-3.x pdf pypdf pdf-extraction

asked Feb 13 '20 at 01:03

Gavrk

295
1
4
16

Questions tagged [pdf-extraction]