how to extract only main text with pdfplumber and ignore image text and tables?

Question

trying to parse any non scanned pdf and extract only text, without tables and their comments or pictures and their comment. just the main text of a pdf, if such text exists. tried pdfplumber.

when trying this piece of code it extract all texts, include tables and their comments.

import pdfplumber

with pdfplumber.open("somePDFname.pdf") as pdf:
  for pdf_page in pdf.pages:
    single_page_text = pdf_page.extract_text()
    print( single_page_text )

saw this solution - How to ignore table and its content while extracting text from pdf but if I understood correctly it was specific for a certain table, so did not work for me as I don't know the dim of the tables/images I'm scanning.

also read the issue in the pdfplumber (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0zejJ2P76AhUzuZUCHZ3oBZkQFnoECBAQAQ&url=https%3A%2F%2Fgithub.com%2Fjsvine%2Fpdfplumber%2Fissues%2F242&usg=AOvVaw3-4BI2LYY2dmH9ldel9_J9).

saw this solution also -https://stackoverflow.com/questions/66293939/how-i-can-extract-only-text-without-tables-inside-a-pdf-file-using-pdfplumber but rather use pdfplumber for later parsing.

Is there a more general solution to the problem?

jainam shah · Answer 1 · 2022-11-17T11:37:39.593

1

Hello you can use a filter after extracting text

clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])

also, you can use specify the front Size in the filer,

import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
   first_page = pdf.pages[0]
   print(first_page.chars[0])

please check the above code for the get dataframe page-wise.

edited Nov 17 '22 at 11:37

answered Nov 17 '22 at 04:32

jainam shah

199
1
11

thanks, doesn't work for me though, in pdfplumber .filter function is only for a page object, not text, when trying `print(pdf_page.filter(lambda obj: obj["object_type"] == "char"))` i get ``, could you please clarify? thank you! – learningtocode Nov 17 '22 at 11:11
please check the newly added code for reference... you can get the page-wise and character-wise details . – jainam shah Nov 17 '22 at 11:35

how to extract only main text with pdfplumber and ignore image text and tables?

1 Answers1