2

I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have.

Example of text I want to extract:

Paragraph Title

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque.

with pdfplumber.open(path_to_pdf) as pdf:
   pageno = 1
   page = pdf.pages[pageno]
   text = page.extract_text(x_tolerance=5)

lines = [x.lower().strip() for x in lines]
print(lines)

How can I alter this to extract paragraphs instead? Right now this would give me this. Basically it is adding each line to an array. ['Paragraph Title', 'lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et', 'dolore magna aliqua. enim facilisis gravida neque convallis a cras semper auctor neque.]

I want it to give me this. It would add the paragraph title and then paragraph to the array. ['Paragraph Title', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque. ']

Solana Liu
  • 45
  • 1
  • 1
  • 6

1 Answers1

0

As far as I can determine, pdf text extraction is just rubbish.You just get lines of text, no paragraphs or columns. There maybe great functions for tables as there is with docx tables, but nothing for run of the mill simple data extraction in its original paragraph layout.

Isaac
  • 1