3

I have a large PDF (~20mb, 160 mb. uncompressed). I need to do a find and replace in the text in it, about 1000 times. Here is what I tried.

  1. Via SVG

    • Tranform to SVG (inkscape)
    • Read SVG line by line and do the replace in the file
    • Transform back to PDF

=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered

  1. Creating ~1000 sed command

    • Uncompress PDF
    • Perform each replace with a sed command
    • Recompress PDF

=> way too long. each sed command takes about 20 sec, leading to several hours of process

  1. Read line-by-line and replace

    • Uncompress PDF
    • Read line by line the PDF
      • find text to be replaced
      • replace using perl
      • write line to a new file
    • Compress the new file

=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)

I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?

I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.

Other ideas are more than welcome ;)

Following advise, I used CAM::PDF, this was the most efficient and simple solution

Denis Rouzaud
  • 2,412
  • 2
  • 26
  • 45

2 Answers2

3

There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv

The main trick is you can compile regexp for search nad replace which works in linear time.

my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;

while (<>) {
    s/$re/$replace{$1}/g;
}

You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.

Community
  • 1
  • 1
Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73
  • well, not exactly the same. In 2, I have to run 1000 times the sed command. With 3, I read only once the file, and if I find a pattern to replace, I replace it. This is why 2 takes 6-7 hours while 3 need < 1min. – Denis Rouzaud Mar 25 '15 at 11:53
  • in 3, the thing is that I read the file as ASCII and write the content in a new file as ASCII, so I believe there is a problem with some binary stream left when uncompressing the file. – Denis Rouzaud Mar 25 '15 at 11:54
  • *“But remember, PDF format is not intended for modification”* What do you mean? Adobe Acrobat does exactly that. – Borodin Mar 25 '15 at 11:56
  • @DenisRouzaud You can run sed only once as well. You can pass more `-e` parameters or write multiple of them in one separated by `;`. – Hynek -Pichi- Vychodil Mar 25 '15 at 12:54
  • @DenisRouzaud What you mean by ASCII file? What do you think it means? ASCII is way how to encode characters in binary. They are EBCDIC computers out there, mostly in museums. Are you working in Windows? You may be should use `-C0` switch or use `binmode()` or `:raw` io layer. Perl doesn't treat ASCII differently by default. – Hynek -Pichi- Vychodil Mar 25 '15 at 13:07
  • @Borodin: PDF contains information how content should _exactly_ appear at output device. It means, If you change text in PDF it will not rearrange _magically_ itself. Compare it with (La)TeX, HTML, ODT, RTF, ... Yep, Adobe Acrobat does exactly that. Sorry, ability to modify PDF is very limited. You can modify machine code of compiled program as well. – Hynek -Pichi- Vychodil Mar 25 '15 at 13:11
  • @Hynek-Pichi-Vychodil: Sure, PDF doesn't re-flow the text automatically, but that doesn't mean you can't edit it. The same applies to an image file, but no one would suggest that the JPEG format wasn't meant for editing. – Borodin Mar 25 '15 at 13:56
  • @Borodin: No, JPEG is not meant for editing because it has already lost image information and each step will make more and more JPEG compression artefacts. But the point is, that doesn't means when you _can_ do it, it was _intended_ to. You _can_ edit it but it was not _intended_ you __should__ do it. You can edit each binary file. – Hynek -Pichi- Vychodil Mar 25 '15 at 14:03
  • @Borodin: From Camelot project (PDF origin): _Our vision for Camelot is to provide a collection of utilities, applications, and system software so that a corporation can effectively capture documents from any application, send electronic versions of these documents anywhere, and view and print these documents on any machines._ See, there is not any modification mentioned. – Hynek -Pichi- Vychodil Mar 25 '15 at 14:41
  • @Hynek-Pichi-Vychodil: I'm not convinced. Sorry. – Borodin Mar 25 '15 at 14:50
  • @Borodin: It's not about convincing you. This is just simple fact. PDF is was not designed and created with the intent of a modification of existing file in mind. You can disagree with it, you can protest against it but it is all what you can do about it. It's fact. – Hynek -Pichi- Vychodil Mar 25 '15 at 14:55
  • I am on linux. Here is the code https://github.com/qgep/QGEP/blob/master/datamodel/diagram/translate_diagram.pl I added the binmode but did not tested it yet. – Denis Rouzaud Mar 25 '15 at 14:56
  • @Hynek-Pichi-Vychodil: Then put it like this: I believe you are wrong. – Borodin Mar 25 '15 at 15:00
  • @Hynek-Pichi-Vychodil: thanks for the CAM::PDF plug. I'm the author of that module. – Chris Dolan Jul 31 '15 at 19:32
0

You can follow the pdftk steps as described in How to find and replace text in a existing PDF file with PDFTK (or other command line application)

You can first split the PDF into smaller documents with a few pages each, replace the text and again merge them together - all using pdftk.

There is also the PDFEdit software (http://pdfedit.cz/en/index.html). It is a GUI app with a scripting interface. You can process individual pages and then do a find replace using scripting commands. See if it loads your PDF.

Community
  • 1
  • 1
gn1
  • 526
  • 2
  • 5
  • Why would you want to split the document into smaller documents? – reinierpost Mar 25 '15 at 11:03
  • Denis: The PDF has a single page and the doc size is 160 MB. This probably means that the PDF contains one big image and the text is in the image rather as text. Please check whether you are able to select the text in a PDF viewer such as Evince. – gn1 Mar 26 '15 at 03:22
  • Reiner: I thought memory limits were being hit as the doc had lots of page. Documents with smaller pages would have been more manageable. – gn1 Mar 26 '15 at 03:24