Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
Questions tagged [pdfplumber]
95 questions
5
votes
1 answer
How to extract text from a two-column PDF using PDFPlumber
I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page…
2
votes
1 answer
Trouble parsing interview transcript (Q&As) where questioner name is sometimes redacted
I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It:
downloads a pdf transcript from the web (Cassidy Hutchinson's 9/14/2022 interview transcript with the J6C)
reads/OCRs that pdf to…

Max Power
- 8,265
- 13
- 50
- 91
2
votes
1 answer
How to solve (cid:x) pdfplumber python text extraction
PDF_Doc
I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution?
Thanks
with…

foliveir
- 59
- 5
2
votes
3 answers
Issue with ligatures when converting PDF to text
I am running into an issue when trying to convert a PDF to text where the ligatures 'fi' 'ff' 'fl' are being converted to an empty space. I have read through quite a few similar threads on the issue but have not found a solution that works.
This…

Garrett
- 21
- 2
2
votes
1 answer
extract borderless table with pdfplumber
I am trying to extract the borderless tables from the PDF document, I have tried few combination with PDF table_settings parameter, however pdfplumber cannot recognize the borderless tables correctly
pdf file can be downloaded from the link
Here is …

go sgenq
- 313
- 3
- 13
2
votes
1 answer
How to extract texts and tables pdfplumber
With the pdfplumber library, you can extract the text of a PDF page, or you can extract the tables from a pdf page.
The issue is that I can't seem to find a way to extract text and tables. Essentially, if the pdf is formatted in this…

Justin Furuness
- 685
- 8
- 21
2
votes
1 answer
Use pdfplumber to extract paragraphs
I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have.
Example of text I want to extract:
Paragraph Title
Lorem ipsum dolor sit amet,…

Solana Liu
- 45
- 1
- 1
- 6
2
votes
2 answers
Extract text and tables of a PDF file in Python
I am looking for a solution to extract both text and tables out of a PDF file. While some packages are good for extracting text, they are not enough good to extract tables.
One solution would be using Azure Form Recognizer Layout Model, but it…

Sam S.
- 627
- 1
- 7
- 23
2
votes
1 answer
Conda wont install pdfplumber
I'm trying to use miniconda3 to install pdfplumber. I keep getting this error message and I don't know how to interpret it.
(env1) C:\Users\engineer>conda install -c conda-forge pdfplumber
Collecting package metadata (current_repodata.json):…

PinAppleRedbull
- 81
- 1
- 6
1
vote
1 answer
pdfplumber table-extract inconsistent columns and stripping spaces
Pdfplumber is the most accurate tool I have found so far for extracting text from a PDF, plus it can extract table data in rows and columns. I have encountered two problems with the table function.
a wide column of text (e.g. a description) may be…

PMSK
- 61
- 4
1
vote
2 answers
pdfplumber python extract_tables setting for the specific strategy
i've this pdf, I'm trying to extract table from pdf. Wwhat is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table , i 've to get [70,75,80,85,90,100,105,110,115,120] and for the…

Herojos
- 61
- 4
1
vote
1 answer
Python, using pdfplumber, pdfminer packages extract text from pdf, bolded characters duplicates
Goal: extract Chinese financial report text
Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt
problem: for PDF text in bold, corresponding extracted text in txt duplicates
Examples are as follows:
Such as the following…

user19560886
- 37
- 2
1
vote
0 answers
pdfplumber rotates all pages in pdf
I am trying to read a PDF using pdfplumber and it usually works, but if there is one page where the text is rotated, it rotates all other pages as well.
This is the code I use:
import pdfplumber
pdf_path = "some_pdf.pdf"
with…

jsiller
- 78
- 7
1
vote
1 answer
Regular expressions python - get only the description V2
i am, again, trying to get description with RE Python modules, and i am almost done, but not everything, so..
I want to extract the description for this list;
list = ['Fatura Original-2ª via',
'Nº Z200 1/8206881085 Data 12-10-2022 Moeda…

foliveir
- 59
- 5
1
vote
0 answers
How to open a pdf in a specific page via python
As a part of my program I was trying to, as the title suggests, open a pdf file in a web browser on a specific page so reading the contents of that pdf page and printing it via pdfplumber or PyPDF2 won't really do.
I tried searching up methods to do…

imsolost
- 11
- 2