How i can extract only text without tables inside a pdf file using PDFplumber?

Question

I want to process some pdf files using a NLP module, then I want to clean those files from all existing tables.

this is the code for extracting tables using pdfplumber

import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[1]
table=page.extract_table()

but I want to inverse the operation to extract text only

Hi @medensa, I also need the answer for same problem. Could you please share what you did eventually? — suptagni, Apr 28 '22 at 04:42

score 0 · Answer 1 · answered Feb 22 '21 at 20:42

disclaimer: I am the author of pText, the library used in this answer.

load the Document
you need to define a LocationFilter

A LocationFilter does pretty much what it says on the tin. It will listen to parsing events (like "render text" or "change font to") but it will only allow those to come through that fall within a given boundary.

Keep in mind the origin in PDF coordinates is at the lower left corner. The LocationFilter in this example will therefor match only text in the lower left corner of the page.

Add a SimpleTextExtraction to the LocationFilter

The next question is "what is the LocationFilter going to pass events to?" In this case, you can start by trying a SimpleTextExtraction.

Putting it all together:

l0 = LocationFilter(0, 0, 100, 100)

l1 = SimpleTextExtraction()
l0.add_listener(l1)

doc = PDF.loads(pdf_file_handle, [l])

After the Document has loaded, you can ask the SimpleTextExtraction for all the text on a given Page.

l1.get_text(0)

You can obtain pText either on GitHub, or using PyPi There are a ton more examples, check them out to find out more about working with images.

score -2 · Answer 2 · answered Feb 22 '21 at 12:21

-2

Do you really have to stick to the pdfplumber?. If not, I can suggest a better solution, use tabula instead. Here is an answer to a similar question you can check out: tabula

answered Feb 22 '21 at 12:21

elmurod1202

15
2

2

Answers should be more than just a link to an external site. At least show how you'd solve this man's problem using tabula. – Joris Schellekens Feb 22 '21 at 20:34

How i can extract only text without tables inside a pdf file using PDFplumber?

2 Answers2

Linked