0

Im using CAM::PDF with Perl to delete/replace some text in pdf files that are only 1 page.

my $repl_str = "redacted"

my $pdf = CAM::PDF->new($file_name) or die("Couldn't read PDF $file_name: $CAM::PDF::errstr");
my $content = $pdf->getPageContent(1);
my $text = $pdf->getPageText(1);

my @del_lines;

my @lines = split (/\n/, $text); # Splits lines into array @lines

foreach my $l (@lines) {

  if ($l =~ /sometexttotemove/ || $l =~ /othertexttoremove/) {
    push @del_lines, $l;
  }

for (@del_lines) {
  s/([\(\)])/\\$1/g; # in PDF, parens are pre-escaped so they need an extra backslash
   my $m = quotemeta;
   $content =~ s/$m/$repl_str/;
 }

$pdf->setPageContent(1, $content);
$pdf->cleanoutput($outfile) or die("Couldn't write ${mrn}_${dt}.pdf: $CAM::PDF::errstr");

}

This works 99.9% of the time, but very rarely I get this error and the script terminates:

Expected string closing
250  >>...

If I look at the $content variable after its extracted and before any manipulation I get this:

BT 450 20430 Td (sometexttotemove) Tj ET
BT 7750 20430 Td (othertexttoremove) Tj ET

and after replacing the strings in $content if get this:

BT 450 20430 Td () Tj ET
BT 7750 20430 Td () Tj ET

This is the same regardless of if the script crashes or not.

Can someone explain why this error is happening?

This initially seemed to happen if there were parentheses in the sometexttotemove, but even with trying to deal with escape characters that does not fix the problem.

Found the error in the CAM::PDF source code at https://github.com/gitpan/CAM-PDF/blob/master/lib/CAM/PDF.pm

but I still dont understand why the error is happening only on some PDFs

HPW
  • 9
  • 1
  • Maybe you should provide such a PDF file along with specific _sometexttotemove_. – Armali May 10 '23 at 06:09
  • Didn't even provide the whole error message?! – ikegami May 10 '23 at 13:13
  • Does "sometexttotemove" or "othertexttoremove" have a backslash or a parenthesis character in them? If so, PDF insists that they must be escaped or balanced. – Chris Dolan May 11 '23 at 11:17
  • Hi. I can't provide a PDF because it contains confidential information. Ive tried with and without escape characters and it does not help. – HPW May 12 '23 at 19:54
  • That is actually the entire error message...... – HPW May 12 '23 at 19:54

0 Answers0