0

trying to parse any non scanned pdf and extract only text, without tables and their comments or pictures and their comment. just the main text of a pdf, if such text exists. tried pdfplumber.

when trying this piece of code it extract all texts, include tables and their comments.

import pdfplumber

with pdfplumber.open("somePDFname.pdf") as pdf:
  for pdf_page in pdf.pages:
    single_page_text = pdf_page.extract_text()
    print( single_page_text )

saw this solution - How to ignore table and its content while extracting text from pdf but if I understood correctly it was specific for a certain table, so did not work for me as I don't know the dim of the tables/images I'm scanning.

also read the issue in the pdfplumber (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0zejJ2P76AhUzuZUCHZ3oBZkQFnoECBAQAQ&url=https%3A%2F%2Fgithub.com%2Fjsvine%2Fpdfplumber%2Fissues%2F242&usg=AOvVaw3-4BI2LYY2dmH9ldel9_J9).

saw this solution also -https://stackoverflow.com/questions/66293939/how-i-can-extract-only-text-without-tables-inside-a-pdf-file-using-pdfplumber but rather use pdfplumber for later parsing.

Is there a more general solution to the problem?

1 Answers1

1

Hello you can use a filter after extracting text

clean_text = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"])

also, you can use specify the front Size in the filer,

import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
   first_page = pdf.pages[0]
   print(first_page.chars[0])

please check the above code for the get dataframe page-wise.

jainam shah
  • 199
  • 1
  • 11
  • thanks, doesn't work for me though, in pdfplumber .filter function is only for a page object, not text, when trying `print(pdf_page.filter(lambda obj: obj["object_type"] == "char"))` i get ``, could you please clarify? thank you! – learningtocode Nov 17 '22 at 11:11
  • please check the newly added code for reference... you can get the page-wise and character-wise details . – jainam shah Nov 17 '22 at 11:35