1

On Ubuntu 18.04 I have a problem with editing PDF files - specifically search & replace strings.

I tried:

  • PHP mPDF Overwrite () do nothing.

  • perl CAM :: PDF 1.60 changepagestring.pl do nothing

  • sed, do nothing.

Does not work with compressed or decompressed PDF, Does not even work with PDF generated from mPDF. UPDATE: after reinstalling libsodium mPDF works fine with PDF files generated fromm mPDF. For other PDF files issue still exist.

Also tried in var / www folders user / group www-data: www / data and in other folders / home e.g.

Any idea for bulk search & replace because I have over 1000 files to process?

The text in the files is readable. Check.

P.S. Search / Replace from the program and online service works with the same files. enter image description here

Permission on files 0755 i 0777

root@sasa-ubuntu-1:/var/www/website.local/wp-content/test/2018/12# ls -la *.pdf
-rwxr-xr-x 1 www-data www-data 847451 Oct 18 12:21 clean.pdf
-rwxrwxrwx 1 www-data www-data 395527 Oct 17 21:41 My-First.pdf
-rwxr-xr-x 1 www-data www-data 838307 Oct 17 23:30 My.pdf
-rwxr-xr-x 1 www-data www-data 838167 Oct 18 12:24 New2.pdf
-rwxr-xr-x 1 www-data www-data 838167 Oct 18 01:20 New.pdf
-rwxrwxrwx 1 www-data www-data 270340 Oct 17 16:39 Test2.pdf
-rwxrwxrwx 1 www-data www-data 274022 Oct 17 16:39 Test1.pdf
-rwxr-xr-x 1 www-data www-data 838000 Oct 18 00:55 Test2.pdf
-rwxrwxrwx 1 www-data www-data 205679 Oct 17 23:44 test.pdf

Perl script allways return "Could not find title" nevermind of readability of file when I print $page variable (see images)

use CAM::PDF;

my $pdf = CAM::PDF->new('test.pdf'); # existing document
my $nump = $pdf->numPages();
#print $nump;

my $page = $pdf->getPageContent(1);

print $page;
# $page now holds the uncompressed page content as a string

# replace the text part
if ($page =~ s/Wagner/SoundTech/g) {
$pdf->setPageContent(1, $page);
}
else {
die "Could not find title\n";
}

$pdf->cleanoutput('Test2.pdf');

enter image description here

A lot of files ends on this way.

The string that I try to find is "Wagner International Music Examinations" or only "Wagner"

mPDF and CAM-PDF are properly installed without warnings and erros and with all dependencies, I hope. Ubuntu 18.04 mPDF version 8.0 PHP 7.2 Perl 5.26.1 CAM-PDF version 1.60

mPDF occasionally have bug with Overwrite() function, I found on their github community.

Any suggestion or another way for bulk search & replace in PDF files?

enter image description here

Sasa Jovanovic
  • 324
  • 2
  • 14
  • Can you provide a link to the PDF file? Then we will have something to test against – Håkon Hægland Oct 18 '19 at 12:39
  • 1
    Of course, there are two version, compressed and uncompressed (uncompressed with pdftk and filename start with "u_" prefix) https://devfeelbetter.wpengine.com/test/pdf/Bach-English-Suite-A-Minor-Alemande-and-Courante.pdf and https://devfeelbetter.wpengine.com/test/pdf/u_Bach-English-Suite-A-Minor-Alemande-and-Courante.pdf – Sasa Jovanovic Oct 18 '19 at 13:46
  • Thanks for the links! I tried to grep the uncompressed file for "Wagner" but it did not match – Håkon Hægland Oct 18 '19 at 14:52
  • 1
    However, [`pdf2txt.py`](https://github.com/euske/pdfminer) was able to find `Wagner`. If you look at the source code for `pdf2txt.py` you should be able to figure out how to do the replacement – Håkon Hægland Oct 18 '19 at 14:56
  • 1
    See also [Search and replace placeholder text in PDF with Python](https://stackoverflow.com/q/39712828/2173773) – Håkon Hægland Oct 18 '19 at 15:09
  • I think I found why `grep` does not work. If you look at the output from your `PDF::CAM` perl script, you can see that `© Wagner International Music Examinations` has been encoded as ... – Håkon Hægland Oct 18 '19 at 15:17
  • ... `[(©)2.3923( )-7.3008(W)0.563614(a)0.870722(g)-0.307108(n)-0.307108(e)0.868977(r)4.22623( )-7.3008(I)4.22623(n)-0.307108(t)6.40391(e)0.868977(r)-10.0683(n)-0.307108(a)0.868977(t)6.40391(i)-7.89059(o)-0.307108(n)-0.307108(a)0.868977(l)6.40391( )-7.3008(M)2.73955(u)-0.307108(s)3.05014(i)-7.89059(c)0.868977( )6.99369(E)-3.66436(x)-0.307108(a)0.868977(m)6.0968(i)-7.89059(n)-0.307108(a)0.868977(t)6.40391(i)-7.89059(o)-0.307108(n)-0.307108(s)389.002]TJ`. Note that the plain text is inside the parenthesis. So I think you need to recalculate all the offsets in the above expression ... – Håkon Hægland Oct 18 '19 at 15:17
  • ... if you want to replace it – Håkon Hægland Oct 18 '19 at 15:20
  • More information about the TJ operator can be found in [Chapter 5.3.2](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf) of PDF reference – Håkon Hægland Oct 18 '19 at 15:35

1 Answers1

1

Here is a hack that currently works almost for your case (I will come back later and try improve this):

use feature qw(say);
use strict;
use warnings;
# the PDF uses a non-standard encoding so it does not help to use UTF-8
# use open qw(:std :encoding(UTF-8)); 
use utf8;
use CAM::PDF;

my $fn = 'test.pdf';  # uncompressed file..
my $save_fn = 'test2.pdf';
my $pdf = CAM::PDF->new($fn);
my $nump = $pdf->numPages();
my $match = 0;
my $replace = '[(\x{a9} SoundTech International Music Examinations)]TJ';
for my $i (1..$nump) {
    my $page = $pdf->getPageContent( $i );
    # replace the text part
    if ($page =~ s/\[\(\x{a9}\).*?\]TJ/$replace/g) {
        $match = 1;
        $pdf->setPageContent($i, $page);
    }
}

if ( $match ) {
    $pdf->cleanoutput($save_fn);
    say "Save $save_fn ..";
}
else {
    say "No match";
}
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
  • Great. Seems good. Do you know why some single quote appears before © sign in converted PDF? – Sasa Jovanovic Oct 18 '19 at 20:18
  • 1
    Seems like the PDF does not use UTF-8 or ASCII, the copy right symbol © is `\x{c2}\x{a9}` in UTF-8 and `\x{c2}` in extended ASCII, but the PDF uses `\x{a9}` instead to represent the symbol. I have updated the answer and it seems to work now – Håkon Hægland Oct 18 '19 at 21:41