I am extracting text from pdf files using python pdfminer library (see docs).
However, pdfminer seems unable to extract all texts in some files and extracts LTFigure
object instead. Assuming from position of this object it "covers" some of the text and thus this text is not extracted.
Both pdf file and short jupyter notebook with the code extracting information from pdf are in the Github repository I created specifically in order to ask this question:
https://github.com/druskacik/ltfigure-pdfminer
I am not an expert on how pdf files work but common sense tells me that if I can look for the text using control + f
in browser, it should be extractable.
I have considered using some other library but the problem is that I also need positions of the extracted words (in order to use them for my machine learning model), which is a functionality only pdfminer seems to provide.