I know that
pdftotext -f 42 -l 42 -layout mypdf.pdf
gives me the extracted content of page 42 from mypdf.pdf
, formatted with the "correct" layout. But I have a two column designed page where the lines between the columns do not match. Aparently, pdftotext
simply drops some of the content.
Is it possible to give it the coordinates of a box within which it should extract the text / layout?
If it is not possible to do within pdftotext
, a Python-solution is also acceptable.