1

So I've spent the good majority of a month on this issue. I'm looking for a way to extract geometry elements (polylines, text, arcs, etc.) from a vectorized PDF organised by the file's OCGs (Optional Content Groups), which are basically PDF layers. Using PDFminer I was able to extract geometry (LTCurves, LTTextBoxes, LTLines, etc.); using PyPDF2, I was able to view how many OCGs were in the PDF, though I was not able to access geometry associated with that OCG. There were a few hacky scripts I've seen and tried online that may have been able to solve this problem, but to no avail. I even resorted to opening the raw PDF data in a text editor and half hazardly removing parts of it to see if I could come up with some custom parsing technique to do this, but again to no avail. Adobe's PDF manual is minimal at best, so that was no help when I was attempting to create a parser. Does anyone know a solution to this.

At this point, I'm open to a solution in any language, using any OS (though I would prefer a solution using Python 3 on Windows or Linux), as long as it is open source / free.

Can anyone here help end this rabbit hole of darkness? Much appreciated!

  • What do you mean exactly by "extract"? Also, perhaps you could elaborate on why want to do this in the first place, of what importance is it? – Ryan Aug 23 '18 at 23:41
  • I basically mean I want to get the geometry objects from the file and know which OCG they are associated with. The goal for this script is to output multiple files each displaying a single layer. – pythonic_programmer Aug 24 '18 at 13:34
  • This is a non-trivial task. You'll need to iterate through all of the drawing instructions one by one looking for the beginning and end of the content marked to belong to an OCG. You'll want to create a new instruction list for each of the layers identified in the Catalog and then separately add those instructions to the Content dictionary for that page. That's the basic idea but there are going to be a lot of particulars depending on the PDF input and how it was assembled. – joelgeraci Aug 24 '18 at 16:08
  • What about content that is not part of a layer (not-optional). You want that to always be included in all the PDF files? Or is that not an issue for the files you are dealing with (all content is part of a layer)? – Ryan Aug 24 '18 at 17:31
  • joelgeraci, What exactly doing you mean by "drawing instructions", is this part of the raw data? Or is this part of the LT class structure of PDFminer's objects. There were a lot of attributes in the class structure where I couldn't understand their purose. And Ryan, I would want those objects too, but I'm assuming if I can grab the objects associated with layers then i can use my current method with PDFminer that allows me to grab all the geometry (without layer association) and find what's missing/. Though, in my experience, all geometry is usually associated with a layer. – pythonic_programmer Aug 24 '18 at 18:24
  • "Though all the reliable solutions I found online were paid closed-source SDKs. You know any open source code that does this already? Thanks!" You did not indicate in your question that was a limitation/requirement. Perhaps you could update your question to reflect that you only want open-source and/or non-paid solutions. – Ryan Aug 27 '18 at 17:43

3 Answers3

5

A PDF document consists of two "types" of data. There is an object oriented "structure" to the document to divide it into pages, and carry meta data (like, for instance, there is this list of Optional Content Groups), and there is a stream oriented list of marking operators that actually "draw" content onto the page.

The fact that there are OCG's, and their names, and a bit about them is stored on object oriented content, and can be extracted by parsing the object content fairly easily. But the membership of the OCG's is NOT stored in the object structure. It can only be found by parsing the Content Stream. A group of marking operators is a member of a particular OCG group when it is preceeded by the content operator /OC /optionacontentgroupname BDC and followed by the operator EMC.

Parsing a content stream is a less than trivial task. There are many tools out there that will do this for you. I would not, myself, attempt to build such a parser from scratch. There is little value in re-writing the wheel.

The complete syntax of PDF is available from many sources. Search the web for "PDF Specification 1.7", or "ISO32000-1:2008". It is a daunting document to read, but it does supply all of the information needed to create both and object and a content parser

  • Thanks for the insight Michael! I agree, parsing was a last ditch effort. Though all the reliable solutions I found online were paid closed-source SDKs. You know any open source code that does this already? Thanks! – pythonic_programmer Aug 25 '18 at 04:20
  • Sorry, I do not. I'm one of the paid closed source providers, myself. – Michael Mckeough Aug 27 '18 at 15:21
1

If your PDF is organized in OGC layers, then you can use gdal_translate command of GDAL.

Use the following command to check all available OGC layers in your PDF file:

gdalinfo "sample.pdf" -mdd LAYERS

Then, use the following to command to extract the partiular layer:

gdal_translate "sample.pdf" -of PNG sample.png --config GDAL_PDF_LAYERS "your_specific_layer_name"

More details are mentioned here.

Sagar Rathod
  • 542
  • 8
  • 13
-1

Hey @pythonic_programmer, I am able to use this python library pdflayers to disable the default view (visible/not visible) of the layer into the new pdf file. https://pypi.org/project/pdflayers/

Pretty much it means disable the default state of the layer in the pdf file: https://helpx.adobe.com/acrobat/using/pdf-layers.html

Any layer not visible meaning that layer will not render to the pdf document when you process (by default).