Use pdfplumber to extract paragraphs

Question

I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have.

Example of text I want to extract:

Paragraph Title

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque.

with pdfplumber.open(path_to_pdf) as pdf:
   pageno = 1
   page = pdf.pages[pageno]
   text = page.extract_text(x_tolerance=5)

lines = [x.lower().strip() for x in lines]
print(lines)

How can I alter this to extract paragraphs instead? Right now this would give me this. Basically it is adding each line to an array. ['Paragraph Title', 'lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et', 'dolore magna aliqua. enim facilisis gravida neque convallis a cras semper auctor neque.]

I want it to give me this. It would add the paragraph title and then paragraph to the array. ['Paragraph Title', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque. ']

`"y_tolerance"=7` does it helps ? – Cristian F. Sep 25 '22 at 14:04 — Cristian F., Sep 25 '22 at 14:04

score 0 · Answer 1 · answered Feb 25 '23 at 10:39

0

As far as I can determine, pdf text extraction is just rubbish.You just get lines of text, no paragraphs or columns. There maybe great functions for tables as there is with docx tables, but nothing for run of the mill simple data extraction in its original paragraph layout.

answered Feb 25 '23 at 10:39

Isaac

1

This type of commentary is better provided as a comment on the question rather than as an answer. – sloppypasta Mar 01 '23 at 17:51

Use pdfplumber to extract paragraphs

1 Answers1