I can't see why this shouldn't work
One reason why that is at least hard, is that in the PDF there are no TextPosition
objects.
In the PDF you find instructions drawing strings in some arbitrary encoding. The PDFBox parsing mechanism splits these strings into individual characters, determines their positions etc, and builds a TextPosition
from it. Unfortunately it does not add a reference back to the original string and character position therein.
Thus, for code to be able to recognize the matching string parts in the PDF, it has to do all the parsing again and compare before copying.
Thus, to implement your objective you had better not only work with the TextPosition
objects but also somehow link them back to the string they come from to start with.
This is somewhat beyond the scope of a stack overflow answer but as this is the (or at least one) focus of your BA work, a decent attempt may fit that scope.
Thus, I'll give some pointers here to give you an idea how to get started.
Why is there no such mechanism in PDFBox to start with?
Actually there once was an example for editing text content of PDF documents in the PDFBox distribution (before version 2). It became more and more obvious, though, that this example relied on a number of preconditions, because documents not fulfilling those preconditions became more and more common, so this example was removed, cf. the PDFBox 2.0.0 migration guide.
You can find a more detailed description of the hindrances to easy text replacement in this answer the quintessence of which is that generic text replacement is somewhere between complicated and impossible; if you can require certain preconditions in the original PDF, though, it becomes the easier the more you can require.
In real life, though, you can only require such preconditions if you have a certain level of control over the input, e.g. if you only process outputs of certain other programs and know that those other programs to fulfill those requirements.
Consequentially PDFBox, being a general purpose library, removed the simple example.
An approach
For a more generic approach to text editing, you should indeed try a combination of text removal and text addition.
For text removal you should consider using something like the generic content stream editor class PdfContentStreamEditor
discussed in this answer. As you want to use highlevel PDFBox classes representing the text (like TextPosition
), though, you probably want to base it on the PdfTextStripper
(which uses these text position objects) instead of PDFGraphicsStreamEngine
.
In that specialized text stripper / content editor, you'd collect all instructions being parsed instead of immediately writing them out again in write
. Additionally you'd associate TextPosition
objects retrieved by processTextPosition
to the current text drawing instruction retrieved by write
to later know which TextPosition
belongs to which position of which text drawing instruction.
When the whole page is parsed, you then can determine the TextPosition
objects you want removed.
Once they are known, find the associated text drawing instruction and position. Now you can split the text of each drawing instruction to change, drop the parts to remove, and replace them by some position advancement (e.g. using numerical entries in the array argument of a TJ instruction).
Once all text drawing instructions related to text positions to delete are so manipulated, you can finally write all the instructions to the editor output.
Thereafter you can add new text as usual at the positions in question.
At least this is how I would approach the task of a more generic text editor. There still are some challenges; e.g. the content stream editor just edits a single content stream while text of a page may be spread over the page content streams and referenced XObject content streams (and actually also pattern content streams).
Depending on the amount of work you are expected to invest in the PDF editing task you may or may not have to look into these challenges.
Documentation
In a comment you remark that you can't find a lot of documentation anywhere. The obvious documentation to use is the PDF specification, ISO 32000-1 and ISO 32000-2. If your department does in-depth PDF tasks a lot, they should have them available for you. If they don't, you can find a copy of ISO 32000-1 with the ISO headers removed published by Adobe on their web site, simply google for 'PDF32000'.
The specification obviously does not document how to replace text but it documents how the content streams look like and which instructions there may be in them.