0

I'm trying to do something that I know isn't 100% reliable, but I've read about it and it is my understanding that the only problem I'm facing with trying to remove certain bits of text from a PDF file is that I can't replace them.

What I'm trying to do is take the contents of a PDF file, then copy that content over to another PDF file, but without a regular expression found. I have found the expressions in my PDF file, and it works.

However, I can't figure out a way to remove them. Is there a way to say something like

// Remove all TextPosition objects that are within this list

Because I have gathered them, and I can't see why this shouldn't work.

Or is there a way to override what gets written to the new file, and then have that overridden method skip all textpositions that I tell it to skip? I've seen examples of this, but none seem to work when I try them out. (In fact, a lot of the methods that are overriden doesn't even seem to be called at all)

  • *"I can't see why this shouldn't work"* - one reason why that is at least hard, is that in the PDF there are no `TextPosition` objects. In the PDF you find instructions drawing strings in some arbitrary encoding. The PDFBox parsing mechanism splits these strings into individual characters, determines their positions etc, and builds a `TextPosition` from it. Unfortunately it does not add a reference back to the original string and character position therein. Thus, for code to be able to recognize the matching string parts in the PDF, it has to do all the parsing again and compare before copying – mkl Apr 23 '20 at 14:54
  • Thus, to implement your objective you had better not only work with the `TextPosition` objects but also somehow link them back to the string they come from to start with. – mkl Apr 23 '20 at 14:57
  • I see. I've actuallty looked a lot at your examples and they are really good. I might be completely off here, but I can't find a lot of documentation anywhere. Best info I got is your examples and some others, but I'd like more examples with explanations somewhere. (Also, I know Apache removed the whole replace text bit, because of how complicated it is) I'm doing my bachelor degree on this (and more) so that's why I'm looking. – harrigaturu Apr 24 '20 at 07:04
  • I'm actually looking at your example with HelloAnalyzer, and thing is your example isn't working with my PDF at all. But what I would really need is some good guidance on these operators. I can't add an operator like you did, it just won't add and thusly my ContentStreamWriter is null. – harrigaturu Apr 24 '20 at 08:58
  • The `HelloAnalyzer` and `HelloSignManipulator` from [this answer](https://stackoverflow.com/a/41125682/1729265) make use of a very special structure of the document contents in question, they are not really useful as a template for generic text editing. – mkl Apr 24 '20 at 11:04
  • Okay, yea I started to figure.. I'm getting better results out of just working with the COSStrings and COSArrays, maybe I can put them together somehow. I'm guessing since you seem like quite an expert at this that this might be undoable? What about just saying for instance "between x1, x2, y1 and y2 do not render contents" to an output stream? – harrigaturu Apr 24 '20 at 11:23
  • *"just working with the COSStrings and COSArrays, maybe I can put them together somehow"* - beware, if in your test documents the contents of those strings happen to be readably encoded (i.e. in some ASCII-like encoding), that's not always so but might make you believe things are easy. *"What about just saying for instance "between x1, x2, y1 and y2 do not render contents" to an output stream?"* - do you mean, applying a clip path that makes everything outside the clip area invisible? Then the text would still be in the file and retrievable by text extractors and copy&paste.Or... – mkl Apr 24 '20 at 12:51
  • ... Or do you mean removing instructions for drawing something in that area? That's essentially redaction. Redaction isn't that easy either, e.g. if there is an instruction drawing a single string "A B C" and you want to remove the B, you actually have to replace it by an equal width of free space, otherwise the C would be moved. In proportional fonts, that width usually is not equal to the width of a number of space characters, so you have to split the instruction to three instructions, draw "A ", move right by the width of B, draw " C". – mkl Apr 24 '20 at 12:57

1 Answers1

2

I can't see why this shouldn't work

One reason why that is at least hard, is that in the PDF there are no TextPosition objects.

In the PDF you find instructions drawing strings in some arbitrary encoding. The PDFBox parsing mechanism splits these strings into individual characters, determines their positions etc, and builds a TextPosition from it. Unfortunately it does not add a reference back to the original string and character position therein.

Thus, for code to be able to recognize the matching string parts in the PDF, it has to do all the parsing again and compare before copying.

Thus, to implement your objective you had better not only work with the TextPosition objects but also somehow link them back to the string they come from to start with.

This is somewhat beyond the scope of a stack overflow answer but as this is the (or at least one) focus of your BA work, a decent attempt may fit that scope.

Thus, I'll give some pointers here to give you an idea how to get started.

Why is there no such mechanism in PDFBox to start with?

Actually there once was an example for editing text content of PDF documents in the PDFBox distribution (before version 2). It became more and more obvious, though, that this example relied on a number of preconditions, because documents not fulfilling those preconditions became more and more common, so this example was removed, cf. the PDFBox 2.0.0 migration guide.

You can find a more detailed description of the hindrances to easy text replacement in this answer the quintessence of which is that generic text replacement is somewhere between complicated and impossible; if you can require certain preconditions in the original PDF, though, it becomes the easier the more you can require.

In real life, though, you can only require such preconditions if you have a certain level of control over the input, e.g. if you only process outputs of certain other programs and know that those other programs to fulfill those requirements.

Consequentially PDFBox, being a general purpose library, removed the simple example.

An approach

For a more generic approach to text editing, you should indeed try a combination of text removal and text addition.

For text removal you should consider using something like the generic content stream editor class PdfContentStreamEditor discussed in this answer. As you want to use highlevel PDFBox classes representing the text (like TextPosition), though, you probably want to base it on the PdfTextStripper (which uses these text position objects) instead of PDFGraphicsStreamEngine.

In that specialized text stripper / content editor, you'd collect all instructions being parsed instead of immediately writing them out again in write. Additionally you'd associate TextPosition objects retrieved by processTextPosition to the current text drawing instruction retrieved by write to later know which TextPosition belongs to which position of which text drawing instruction.

When the whole page is parsed, you then can determine the TextPosition objects you want removed.

Once they are known, find the associated text drawing instruction and position. Now you can split the text of each drawing instruction to change, drop the parts to remove, and replace them by some position advancement (e.g. using numerical entries in the array argument of a TJ instruction).

Once all text drawing instructions related to text positions to delete are so manipulated, you can finally write all the instructions to the editor output.

Thereafter you can add new text as usual at the positions in question.

At least this is how I would approach the task of a more generic text editor. There still are some challenges; e.g. the content stream editor just edits a single content stream while text of a page may be spread over the page content streams and referenced XObject content streams (and actually also pattern content streams).

Depending on the amount of work you are expected to invest in the PDF editing task you may or may not have to look into these challenges.

Documentation

In a comment you remark that you can't find a lot of documentation anywhere. The obvious documentation to use is the PDF specification, ISO 32000-1 and ISO 32000-2. If your department does in-depth PDF tasks a lot, they should have them available for you. If they don't, you can find a copy of ISO 32000-1 with the ISO headers removed published by Adobe on their web site, simply google for 'PDF32000'.

The specification obviously does not document how to replace text but it documents how the content streams look like and which instructions there may be in them.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Sorry for the late response! This answer was clearly what I was looking for. I will look into this. However I will say that I managed to create something from my own ideas, + your examples and someone else's with some trial & error. It first analyzes (just like your example) then creates a new pdf with every hint of the desired pattern removed, then creates a third document with new text inserted on the empty spots. Some weird things like I have to load the document 2 times for it to properly gather info from the document, but I guess there's just more to read up on. – harrigaturu May 05 '20 at 10:25