0

I am trying to extract text from pdf files using PyTesseract and some Python libraries and I want to ignore all tables , charts or images in my files and extract only text (paragraphs, sentences etc). I didn't find a way yet of doing this. Anyone knows how to do it ? Thanks in advance

  • Please elaborate what you've already tried to solve your issue - best would be to insert a code snippet of your attempt so StackOverflow users can take a look at it and help you out. – J. M. Arnold Jan 01 '21 at 12:34
  • maybe you should use some module which works with PDF without OCR. Of course even simple PDF may have complex structure and working with PDF can be a hell - but working with OCR is not easy too. – furas Jan 01 '21 at 16:46
  • I think this ignores images and tables: https://stackoverflow.com/questions/39854841/pdfminer-python-3-5/40877143#40877143. And you might "camelot" give a try and search for examples here on stackoverflow: https://github.com/atlanhq/camelot – pyano Jan 22 '21 at 10:27

0 Answers0