trying to parse any non scanned pdf and extract only text, without tables and their comments or pictures and their comment. just the main text of a pdf, if such text exists. tried pdfplumber.
when trying this piece of code it extract all texts, include tables and their comments.
import pdfplumber
with pdfplumber.open("somePDFname.pdf") as pdf:
for pdf_page in pdf.pages:
single_page_text = pdf_page.extract_text()
print( single_page_text )
saw this solution - How to ignore table and its content while extracting text from pdf but if I understood correctly it was specific for a certain table, so did not work for me as I don't know the dim of the tables/images I'm scanning.
also read the issue in the pdfplumber (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0zejJ2P76AhUzuZUCHZ3oBZkQFnoECBAQAQ&url=https%3A%2F%2Fgithub.com%2Fjsvine%2Fpdfplumber%2Fissues%2F242&usg=AOvVaw3-4BI2LYY2dmH9ldel9_J9).
saw this solution also -https://stackoverflow.com/questions/66293939/how-i-can-extract-only-text-without-tables-inside-a-pdf-file-using-pdfplumber but rather use pdfplumber for later parsing.
Is there a more general solution to the problem?