0

I was wondering if there is a programming library available that allows for the inline editing of text within a PDF document. Drawing text unto the document isn't what I'm after this time and I am already aware of a number facilities and libraries that allow this to be done; I am looking for something that will allow me to make a change like this (where NEW isn't drawn in but edited in, for instance, a string):

"This is my document" become "This is my NEW document".

... The formatting should be preserved (especially where editing isn't being done within a specific area on the page). Word wrapping support would be great too!

So is there anything like this out there or am I barking up the wrong tree? I've looked at a range facilities such as FPDF, PdfBox, and even GNOME without much luck (tbh, I am sure GNOME may allow it but getting my head around it is too time consuming at the moment- so pointers on this will be also be great).

Thanks and sorry if this has been already asked.

In terms of programming languages: I willing to utilise what is suggested in C, C++, Java, PHP, Python, and Perl.

tiredone
  • 59
  • 1
  • 9
  • Maybe [this](http://stackoverflow.com/a/9393318/1255746) is helpful. – Josh M Aug 27 '13 at 15:55
  • Hmm... I am going to update the question later to state that the formatting should be kept within the edited line. But first, does your suggestion keep the formatting? – tiredone Aug 27 '13 at 16:18
  • Perhaps I am approaching the problem from the wrong angle and should be manipulating some other standard type of document text (such as Microsoft Word's or Libre Office's XML format) and then exporting that to PDF- but what would the best library be for that (i.e. say XML/HTML5 to PDF)? – tiredone Aug 27 '13 at 21:53
  • 1
    If that is an option, you should switch formats. PDF is an end Format and Any Attempt to substantially change existing content (in contrast to adding new content) is at least very difficult, especially if it includes reflowing. Which format is best, depends on circumstances, e.g. who creates templates. – mkl Aug 27 '13 at 22:26

2 Answers2

0

To follow up on my comments, this is what fairly typical raw PDF text output looks like -- a deflated part of page 1213 of the PDF Reference Guide 16-v4:

36451 0 obj  % Contents
% used filter: FlateDecode
/GS2 gs
BT
/F1 1 Tf
8 0 0 8 297.417 105.667 Tm
0 0 0 1 k
0 Tc
0 Tw
(1213) Tj
/F5 1 Tf
24 0 0 24 253.784 617 Tm
[ (C) 19.1 (olophon) ] TJ
/F3 1 Tf
10.505 0 0 10.505 136.5 566 Tm
-0.0014 Tc
0.2018 Tw
[ (This do) -10.1 (c) -7.2 (u) -0.3 (men) 17.6 (t) -1.4 ( was p) 10 (r) 11.9 (o) -10.1 (d) 10.8 (uce) -7.2 (d) -1.3 ( usin) 6.6 (g ) 36.5 (A) 24.6 (d) 0.9 (o) 3.8 (b) -10.1 (e) ] TJ
8.4 0 0 8.4 326.25 570.2 Tm
0 Tc

.. several hundred more lines like these omitted. Some points of interest: Tf sets the text font (which is defined elsewhere, and which may have a custom encoding -- not always ASCII). Tj 'shows' text; Tm sets a transformation matrix in 'current units'. It's impossible to immediately see whether the text 'Colophon' follows right after the '1213' without knowing the actual size of both. The Tc and Tw set default character and word spacing, and is often abused to insert 'spaces'. Not here, though; the TJ array specifies text fragments with interspersed kerning values (I guess, based on their location).

It's not possible to determine of this single text line is a line on its own, or part of a longer paragraph. It's not even possible to determine if it's a justified string or not -- you would need to compare its left and right edges to other lines to find out.

(This output is created with a PDF reader I wrote myself from scratch, using aforementioned reference and not much more.)

As you can see, merely finding text is a challenge, although there are libraries which are more or less successful in that. None of them -- if I'm correct -- boast to be able to edit "any PDF".

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • Your answer has come closest to the truth of the matter. Along with the various comments already made, my suggestion to others looking for something similiar, is to use a truely editable format and then export that to PDF. I am going investigate my options further from here, and if I find a quick solution I'll leave another comment. – tiredone Aug 28 '13 at 07:25
  • I am going to proceed using LibreOffice's unoconv commandline program (for now) in conjunction with libreoffice's .fodt format. Some like this: `unoconv -f pdf -o out.pdf MyDocument.fodt` – tiredone Aug 28 '13 at 08:08
  • The only disadvantage is that, unless I use the listener I'll have to execute this command each time within a shell context and probably with a known temporarily file. The unoconv3.py looks interesting though. – tiredone Aug 28 '13 at 08:19
-1

Just lookup the text in the pdf-file and modify it. If I am not mistaken, string literals are encoded like this: (This is my document). If the text you want to change is split into more than one string literal or you need word-wrap, then any pdf-library probably isn't going to help you much.

  • It is like `(This is my document)` only in the simplest documents, i.e. In documents with standard encodings only and without kerning. – mkl Aug 27 '13 at 17:34
  • Text in just about *any* PDF document is broken up over multiple strings. Consider font, size, or color changes, word- and letter spacing, and left, right, centered or justified text. And no two PDF producers follow the same routine to do the same thing. "Automatic word wrapping"... forget it. PDF is not meant to be editable this way. – Jongware Aug 27 '13 at 17:36
  • The document I have doesn't actually contain plain text within, its all encoded. The version of PDF used by the document is 1.5 (PDF-1.5). Which version of PDF should I use to encode the pdf so that the plain text is visible? The word wrapping limitations could be a blocker it seems. – tiredone Aug 27 '13 at 21:39
  • "Encoded", why do you think so? It's probably just compressed -- a valid operation for about any version of PDF. It may be 'encoded' with any number of the valid PDF Encoding streams (ASCIIHex, ASCII85, LZW, Flate, RunLength, CCITTFax), in any order (3.3 of PDF Reference 1.4, but also valid for your 1.5). In addition, seperate object streams may have been concatenated into a single composite object. It's version independent: if you create a PDF, you can choose to compress plain data streams or not -- the latter will yield a larger but readable PDF. – Jongware Aug 27 '13 at 22:30
  • This suggestion is not an answer to the question, nor is there really any good answer other than starting with the original documents. This suggestion will not work likely even in the simplest of cases. PDF is not meant to be an editable format like specified. In any case where manipulation is required, the only valid solution is starting from original content. – Kevin Brown Aug 28 '13 at 03:18