1

I have existing content stream in pdf. And I wanted to extract the content stream under the below bounding box's. enter image description here

1st BBox the graphics, text content stream . 2nd BBOX text, some math equations related content stream. 3rd BBOX Image plus text content stream is there.

  1. So I want to extract the all content stream with in the bbox?

  2. After extracting content stream I will do tagging related manipulation in content stream and I want to place it back to new PDF?

This operations I wanted to do by using PDFBox. Is it possible?

Please help me how can I achieve this..

  • In attached PDF the content stream is not in sequence. I mean while parsing content stream all text related stream we can found first, math formula related stream placed at bottom. But we have to pull all and need to put at one place in new PDF. So the position related operator values will should change dynamically. How can do this using pdfbox? – fascinating coder Dec 26 '19 at 14:27
  • Pdfbox provides a content stream parsing framework (`PDFGraphicsStreamEngine`, `PDFStreamEngine`, ...) upon which you can base a content stream editing mechanism. You can do so in a manner similar to the `PdfContentStreamEditor` from [this answer](https://stackoverflow.com/a/58501254/1729265). But there still is quite a lot left to do for you, you have to make sure that after shuffling the instructions all the actual drawing instructions are run with their original graphics state. – mkl Dec 26 '19 at 15:28
  • @mkl Thanks for comment. Along with PdfContentStreamEditor example classes Do you have any example code for graphics state manipulation? If yes please give me some links. – fascinating coder Dec 27 '19 at 03:17
  • I have used that `PdfContentStreamEditor` only for simple things yet, e.g. to remove specific text (large or transparent or in special fonts) or to change specific colors, but never for a task like yours. That's why I said there still is a lot to do. One option may be to extend the editor class to have multiple `ContentStreamWriter` instances, one for each BOX on the current page, and output all non-drawing instructions to all of them and the drawing instructions only to the matching one. Then those content streams shall be enveloped in `q ... Q` and concatenated to form the result stream. – mkl Dec 27 '19 at 09:25
  • I tried for process page using PDFGraphicsStreamEngine It's completely messing up the positions of glyph's. check below link https://drive.google.com/file/d/1GD5gMNyzF4vlbJ_kF47xPFtP2ITcIfmW/view?usp=sharing So what is the generic approach? – fascinating coder Dec 27 '19 at 11:54
  • I'll look into that when I'm back in office next year. – mkl Dec 27 '19 at 15:39
  • You have indeed discovered a bug in the `PdfContentStreamEditor` - a few `OperatorProcessor` implementations recursively feed other operators into the stream engine by which they are defined. For the purpose of editing one has to ignore those recursive processing calls which `PdfContentStreamEditor` does not do yet. – mkl Jan 09 '20 at 17:12
  • I have fixed the issue in the `PdfContentStreamEditor` and described the fix in the [answer referenced above](https://stackoverflow.com/a/58501254/1729265). – mkl Jan 10 '20 at 13:51
  • Concerning the task itself, sorting contents by area, I had started implementing something (in the course of which I reproduced and fixed that bug) but soon stopped as it became obvious that there are a number of non-trivial problems to solve, in particular whenever some entity goes beyond the borders of one of the rectangles. I very much assume that a solution will require much more time than I have to spare currently. – mkl Jan 13 '20 at 16:46

0 Answers0