0

I know that

pdftotext -f 42 -l 42 -layout mypdf.pdf

gives me the extracted content of page 42 from mypdf.pdf, formatted with the "correct" layout. But I have a two column designed page where the lines between the columns do not match. Aparently, pdftotext simply drops some of the content.

Is it possible to give it the coordinates of a box within which it should extract the text / layout?

If it is not possible to do within pdftotext, a Python-solution is also acceptable.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • you could use Ghostscript to divide pages in to two then extract them. or [Briss](https://sourceforge.net/projects/briss/) – Ulug Toprak Aug 16 '17 at 09:16
  • @UlugToprak This sounds as if it would be a lot of work. I would need to be able recognize the position of each element first and then be able to create valid PDFs from that. – Martin Thoma Aug 16 '17 at 09:20
  • Briss is quick to install and easy to try i suppose worth trying? Ghostscript would be bit more tricky and time consuming specially if the column margins are different from page to page. Also Briss doesn't actually crops but rather masks the parts you select while Ghostscript actually splits them ready to use with `pdftotext` – Ulug Toprak Aug 16 '17 at 09:23
  • @UlugToprak Looks as if Briss could be worth a try. Do you know if there is something similar for Python? – Martin Thoma Aug 16 '17 at 10:03
  • Oh, I might just have found it: PyPDF seems to be able to do so: https://stackoverflow.com/a/465901/562769 – Martin Thoma Aug 16 '17 at 10:04

1 Answers1

1

The latest version of pdftotext should do what you want.

Example:

pdftotext -x 100 -y 100 -W 20 -H 20 your-file.pdf -

This should give you the text within a 20x20 box in coordinates x = 100 & y = 100. (y goes from top to bottom).

Notes

  • I used version 0.90.1:

    pdftotext version 0.90.1
    Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
    Copyright 1996-2011 Glyph & Cog, LLC
    
  • Mac users: install with brew install poppler

  • Documentation

Mario Pérez Alarcón
  • 3,468
  • 2
  • 27
  • 38