2

I want to process some pdf files using a NLP module, then I want to clean those files from all existing tables.

this is the code for extracting tables using pdfplumber

import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[1]
table=page.extract_table()

but I want to inverse the operation to extract text only

medensa
  • 21
  • 2
  • Hi @medensa, I also need the answer for same problem. Could you please share what you did eventually? – suptagni Apr 28 '22 at 04:42

2 Answers2

0

disclaimer: I am the author of pText, the library used in this answer.

  1. load the Document

  2. you need to define a LocationFilter

A LocationFilter does pretty much what it says on the tin. It will listen to parsing events (like "render text" or "change font to") but it will only allow those to come through that fall within a given boundary.

Keep in mind the origin in PDF coordinates is at the lower left corner. The LocationFilter in this example will therefor match only text in the lower left corner of the page.

  1. Add a SimpleTextExtraction to the LocationFilter

The next question is "what is the LocationFilter going to pass events to?" In this case, you can start by trying a SimpleTextExtraction.

Putting it all together:

l0 = LocationFilter(0, 0, 100, 100)

l1 = SimpleTextExtraction()
l0.add_listener(l1)

doc = PDF.loads(pdf_file_handle, [l])

After the Document has loaded, you can ask the SimpleTextExtraction for all the text on a given Page.

l1.get_text(0)

You can obtain pText either on GitHub, or using PyPi There are a ton more examples, check them out to find out more about working with images.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
-2

Do you really have to stick to the pdfplumber?. If not, I can suggest a better solution, use tabula instead. Here is an answer to a similar question you can check out: tabula