Highest Voted 'pdfplumber' Questions

5

votes

1 answer

How to extract text from a two-column PDF using PDFPlumber

I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page…

asked Aug 25 '21 at 08:04

Ramachandran Ravishankar

61
4

2

votes

1 answer

Trouble parsing interview transcript (Q&As) where questioner name is sometimes redacted

I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It: downloads a pdf transcript from the web (Cassidy Hutchinson's 9/14/2022 interview transcript with the J6C) reads/OCRs that pdf to…

python regex pdf nlp pdfplumber

asked Feb 22 '23 at 05:01

Max Power

8,265
13
50
91

2

votes

1 answer

How to solve (cid:x) pdfplumber python text extraction

PDF_Doc I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution? Thanks with…

python pypdf pdftotext pdfplumber

asked Nov 12 '22 at 22:03

foliveir

59
5

2

votes

3 answers

Issue with ligatures when converting PDF to text

I am running into an issue when trying to convert a PDF to text where the ligatures 'fi' 'ff' 'fl' are being converted to an empty space. I have read through quite a few similar threads on the issue but have not found a solution that works. This…

python pdf pdftotext pdfplumber

asked Sep 14 '22 at 19:48

Garrett

21
2

2

votes

1 answer

extract borderless table with pdfplumber

I am trying to extract the borderless tables from the PDF document, I have tried few combination with PDF table_settings parameter, however pdfplumber cannot recognize the borderless tables correctly pdf file can be downloaded from the link Here is …

python python-3.x tabula python-camelot pdfplumber

asked Jul 06 '22 at 15:18

go sgenq

313
3
13

2

votes

1 answer

How to extract texts and tables pdfplumber

With the pdfplumber library, you can extract the text of a PDF page, or you can extract the tables from a pdf page. The issue is that I can't seem to find a way to extract text and tables. Essentially, if the pdf is formatted in this…

python pdf pdfplumber

asked Mar 25 '22 at 04:17

Justin Furuness

685
8
21

2

votes

1 answer

Use pdfplumber to extract paragraphs

I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have. Example of text I want to extract: Paragraph Title Lorem ipsum dolor sit amet,…

python pdf-extraction pdfplumber

asked Feb 15 '22 at 00:28

Solana Liu

45
1
1
6

2

votes

2 answers

Extract text and tables of a PDF file in Python

I am looking for a solution to extract both text and tables out of a PDF file. While some packages are good for extracting text, they are not enough good to extract tables. One solution would be using Azure Form Recognizer Layout Model, but it…

python pdf ocr pypdf pdfplumber

asked Sep 21 '21 at 01:40

Sam S.

627
1
7
23

2

votes

1 answer

Conda wont install pdfplumber

I'm trying to use miniconda3 to install pdfplumber. I keep getting this error message and I don't know how to interpret it. (env1) C:\Users\engineer>conda install -c conda-forge pdfplumber Collecting package metadata (current_repodata.json):…

python conda pdfplumber

asked Aug 05 '21 at 20:42

PinAppleRedbull

81
1
6

1

vote

1 answer

pdfplumber table-extract inconsistent columns and stripping spaces

Pdfplumber is the most accurate tool I have found so far for extracting text from a PDF, plus it can extract table data in rows and columns. I have encountered two problems with the table function. a wide column of text (e.g. a description) may be…

python pdfplumber

asked Jul 06 '23 at 13:57

PMSK

61
4

1

vote

2 answers

pdfplumber python extract_tables setting for the specific strategy

i've this pdf, I'm trying to extract table from pdf. Wwhat is the better strategy to get the table? I can not be able to get the value specific on table , for example in the first table , i 've to get [70,75,80,85,90,100,105,110,115,120] and for the…

python pdfplumber

asked Jun 29 '23 at 20:42

Herojos

61
4

1

vote

1 answer

Python, using pdfplumber, pdfminer packages extract text from pdf, bolded characters duplicates

Goal: extract Chinese financial report text Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt problem: for PDF text in bold, corresponding extracted text in txt duplicates Examples are as follows: Such as the following…

pdf nlp extract cjk pdfplumber

asked Apr 10 '23 at 08:17

user19560886

37
2

1

vote

0 answers

pdfplumber rotates all pages in pdf

I am trying to read a PDF using pdfplumber and it usually works, but if there is one page where the text is rotated, it rotates all other pages as well. This is the code I use: import pdfplumber pdf_path = "some_pdf.pdf" with…

python pdfplumber

asked Jan 18 '23 at 16:34

jsiller

78
7

1

vote

1 answer

Regular expressions python - get only the description V2

i am, again, trying to get description with RE Python modules, and i am almost done, but not everything, so.. I want to extract the description for this list; list = ['Fatura Original-2ª via', 'Nº Z200 1/8206881085 Data 12-10-2022 Moeda…

python regex regex-lookarounds pdfplumber

asked Oct 15 '22 at 17:00

foliveir

59
5

1

vote

0 answers

How to open a pdf in a specific page via python

As a part of my program I was trying to, as the title suggests, open a pdf file in a web browser on a specific page so reading the contents of that pdf page and printing it via pdfplumber or PyPDF2 won't really do. I tried searching up methods to do…

python pdf pypdf pdfplumber

asked Oct 09 '22 at 07:22

imsolost

11
2

Questions tagged [pdfplumber]