0

I have a folder with lots of PDF files. I need to get the filename of matching content files as well as specific text in them - Rotate 270, which defines a page rotation. Grep's arguments anH or /dev/null method seems not to work, nor can pdftotext or pdfgrep help, as it is not any visible or searchable text on page I need. I can either get the "Binary file aaa.pdf matches" or the line like this (which is not a text visible on a page!):

<</Filter/FlateDecode/Length 61>>stream4 595.19995]/MediaBox[0 0 841.92004 595.19995]/Parent 5 0 R/Resources<</ProcSet[/PDF/Text/ImageB/ImageC/ImageI]/XObject<</img3 11 0 R>>>>/Rotate 270/Type/Page>>

Suspect there is a way to loose the non printable bytes before grep gets them, or split the filename before grep part and assemble back after the grep has found the line, or maybe sed has an easy way to achieve this?

How do I get both filename and found line, approximately like grep does on regular text files?

uldics
  • 117
  • 1
  • 11
  • 1
    It is not clear to me exactly what you are asking but if you want to do grep on PDF files, have a look at [pdfgrep](https://pdfgrep.org/). (On a Debian-like system, you can install it with `apt-get install pdfgrep`.) – John1024 Jan 17 '18 at 08:00
  • I know pdfgrep, it is for other purposes. I don't need to search for visible (or even searchable) text on a page in PDF document. I need to search for definitions of objects to be shown on a document, page definitions, font definitions, image definitions etc. – uldics Jan 17 '18 at 10:00
  • I think you believe the text that you see when *viewing* a PDF with an appropriate viewer is stored "approximately like regular text files", to paraphrase your question. **That is not the case.** Plain text can be inside a compressed object, encoded with any of a set of standard encodings *plus* adjustments on those, and stored in disparate sections, interleaved with drawing commands. – Jongware Jan 17 '18 at 10:12
  • 1
    .. That said, this is confusing: ".. specific text in them - Rotate 270, which defines a page rotation .." and later on you state that you *can* find text "which is not a text visible on a page". The text "Rotate 270" is not visible on a page either. Or, if it is, then it's not a PDF instruction but just plain text. – Jongware Jan 17 '18 at 10:14
  • usr2564301, I see you are confused. Just open any PDF file with any text viewer (nano, atom, notepad) and you will understand what I mean. I do not need a text you can see with Adobe Reader. I need "the code" from the definitions of your mentioned compressed (or not) objects. Orders, commands for the Adobe Reader to show the PDF file appropriately. Not the stuff Adobe Reader is showing. – uldics Jan 17 '18 at 12:07
  • 1
    Take a look at this post: https://stackoverflow.com/a/29474423/501196 maybe you can combine one of these tools with grep somehow. Compressed object streams are going to be a challenge for doing what you want. Maybe one of them will allow you to encode all binary stream as Ascii85decode or similar – yms Jan 17 '18 at 13:07

1 Answers1

0

I don't have a pdf file with that string inside but you can try

identify -verbose somefile.pdf | grep 'Rotate 270'

identify is part of ImageMagick package.

You can also try a brute force method :-)

strings somefile.pdf | grep 'Rotatae 270'
LMC
  • 10,453
  • 2
  • 27
  • 52
  • It would not give the necessary result - matching filename and the exact line with all the other parameters, where the Rotate 270 is in; need to see how exactly that Rotate 270 is applied. And ImageMagick is a little heavy in dependencies for just getting such simple results on a headless machine. Then pdfinfo from poppler utils would be a bit lighter. Nevertheless, thanks, I will consider it when I get to some more heavy image analysis necessity on same machine. – uldics Jan 18 '18 at 06:24
  • @uldics Updated answer with another option. – LMC Jan 18 '18 at 14:56