I'm using pdfplumber to extract text from a pdf. I'm able to extract lines of text, but I'm having trouble extracting a paragraph. Here's the current code I have.
Example of text I want to extract:
Paragraph Title
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque.
with pdfplumber.open(path_to_pdf) as pdf:
pageno = 1
page = pdf.pages[pageno]
text = page.extract_text(x_tolerance=5)
lines = [x.lower().strip() for x in lines]
print(lines)
How can I alter this to extract paragraphs instead? Right now this would give me this. Basically it is adding each line to an array. ['Paragraph Title', 'lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et', 'dolore magna aliqua. enim facilisis gravida neque convallis a cras semper auctor neque.]
I want it to give me this. It would add the paragraph title and then paragraph to the array. ['Paragraph Title', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Enim facilisis gravida neque convallis a cras semper auctor neque. ']