Questions tagged [pdfplumber]

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

95 questions
5
votes
1 answer

How to extract text from a two-column PDF using PDFPlumber

I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page…
2
votes
1 answer

Trouble parsing interview transcript (Q&As) where questioner name is sometimes redacted

I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It: downloads a pdf transcript from the web (Cassidy Hutchinson's 9/14/2022 interview transcript with the J6C) reads/OCRs that pdf to…
Max Power
  • 8,265
  • 13
  • 50
  • 91
2
votes
1 answer

How to solve (cid:x) pdfplumber python text extraction

PDF_Doc I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution? Thanks with…
foliveir
  • 59
  • 5
2
votes
3 answers

Issue with ligatures when converting PDF to text

I am running into an issue when trying to convert a PDF to text where the ligatures 'fi' 'ff' 'fl' are being converted to an empty space. I have read through quite a few similar threads on the issue but have not found a solution that works. This…
Garrett
  • 21
  • 2
2
votes
1 answer

extract borderless table with pdfplumber

I am trying to extract the borderless tables from the PDF document, I have tried few combination with PDF table_settings parameter, however pdfplumber cannot recognize the borderless tables correctly pdf file can be downloaded from the link Here is …
go sgenq
  • 313
  • 3
  • 13
2
votes
1 answer

How to extract texts and tables pdfplumber

With the pdfplumber library, you can extract the text of a PDF page, or you can extract the tables from a pdf page. The issue is that I can't seem to find a way to extract text and tables. Essentially, if the pdf is formatted in this…
Justin Furuness
  • 685
  • 8
  • 21
2
votes
1 answer

Use pdfplumber to extract paragraphs

I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have. Example of text I want to extract: Paragraph Title Lorem ipsum dolor sit amet,…
Solana Liu
  • 45
  • 1
  • 1
  • 6
2
votes
2 answers

Extract text and tables of a PDF file in Python

I am looking for a solution to extract both text and tables out of a PDF file. While some packages are good for extracting text, they are not enough good to extract tables. One solution would be using Azure Form Recognizer Layout Model, but it…
Sam S.
  • 627
  • 1
  • 7
  • 23
2
votes
1 answer

Conda wont install pdfplumber

I'm trying to use miniconda3 to install pdfplumber. I keep getting this error message and I don't know how to interpret it. (env1) C:\Users\engineer>conda install -c conda-forge pdfplumber Collecting package metadata (current_repodata.json):…
1
vote
1 answer

pdfplumber table-extract inconsistent columns and stripping spaces

Pdfplumber is the most accurate tool I have found so far for extracting text from a PDF, plus it can extract table data in rows and columns. I have encountered two problems with the table function. a wide column of text (e.g. a description) may be…
PMSK
  • 61
  • 4
1
vote
2 answers

pdfplumber python extract_tables setting for the specific strategy

i've this pdf, I'm trying to extract table from pdf. Wwhat is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table , i 've to get [70,75,80,85,90,100,105,110,115,120] and for the…
Herojos
  • 61
  • 4
1
vote
1 answer

Python, using pdfplumber, pdfminer packages extract text from pdf, bolded characters duplicates

Goal: extract Chinese financial report text Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt problem: for PDF text in bold, corresponding extracted text in txt duplicates Examples are as follows: Such as the following…
1
vote
0 answers

pdfplumber rotates all pages in pdf

I am trying to read a PDF using pdfplumber and it usually works, but if there is one page where the text is rotated, it rotates all other pages as well. This is the code I use: import pdfplumber pdf_path = "some_pdf.pdf" with…
jsiller
  • 78
  • 7
1
vote
1 answer

Regular expressions python - get only the description V2

i am, again, trying to get description with RE Python modules, and i am almost done, but not everything, so.. I want to extract the description for this list; list = ['Fatura Original-2ª via', 'Nº Z200 1/8206881085 Data 12-10-2022 Moeda…
foliveir
  • 59
  • 5
1
vote
0 answers

How to open a pdf in a specific page via python

As a part of my program I was trying to, as the title suggests, open a pdf file in a web browser on a specific page so reading the contents of that pdf page and printing it via pdfplumber or PyPDF2 won't really do. I tried searching up methods to do…
imsolost
  • 11
  • 2
1
2 3 4 5 6 7