4

I have a Powershell script that uses iTextSharp to extract text from PDF files. One of the files the script downloads comes in sideways, so it needs to be rotated in order for the script to read it.

Here's my function which reads the PDF. I've tested it and it works:

function Get-PdfText {
    [CmdletBinding()]
    [OutputType([string])]
    param (
        [Parameter(Mandatory = $true)]
        [string]
        $Path
    )

    try {
        $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $Path
    }
    catch {
        throw
    }

    $stringBuilder = New-Object System.Text.StringBuilder

    for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
        $text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page)
        $null = $stringBuilder.AppendLine($text) 
    }

    $reader.Close()

    return $stringBuilder.ToString()
}

There is plenty of documentation about how to rotate PDFs in C# and Java, but not Powershell. There's a nice example here, but I don't know how to convert it to Powershell: http://developers.itextpdf.com/question/how-rotate-page-90-degrees

Here's my attempt at converting it:

function RotatePdf90Degrees {
    param (
        [Parameter(Mandatory = $true)]
        [string]
        $Path
    )

    $reader = New-Object iTextSharp.text.pdf.PdfReader -ArgumentList $Path
    $n = $reader.NumberOfPages
    $page #PdfDictionary
    $rotate #PdfNumber
    for ($p = 1; $p -le $n; $p++) {
        $page = $reader.GetPageN($p);
        $rotate = $page.GetAsNumber([iTextSharp.text.pdf.PdfName]::ROTATE);
        if ($rotate -eq $null) {
            $page.put([iTextSharp.text.pdf.PdfName]::ROTATE, [iTextSharp.text.pdf]::PdfNumber(90));
        }
        else {
            $page.put([iTextSharp.text.pdf.PdfName]::ROTATE, [iTextSharp.text.pdf]::PdfNumber(($rotate.IntValue() + 90) % 360));
        }
    }

    $stamper = New-Object iTextSharp.text.pdf.PdfStamper ($reader, [System.IO.StreamWriter] $Path);
    $stamper.Close();
    $reader.Close();
}

Something is wrong on the $page.put() lines. I don't know how to feed that function a proper PdfNumber object.

I've been using this documentation: http://developers.itextpdf.com/reference/package/com.itextpdf.text.pdf

Fungusface
  • 125
  • 1
  • 9
  • Try throwing a `New-Object` in there like `New-Object [iTextSharp.text.pdf]::PdfNumber(90)` – Chris Haas Mar 14 '16 at 17:21
  • The script won't compile with that. "Unexpected token 'New-Object' in expression or statement." – Fungusface Mar 14 '16 at 17:24
  • Sorry, PowerShell is very c#-like but not exactly. `PdfNumber` is an object so you need to `new` it somehow but I don't know if you can do it inline. How about `New-Object iTextSharp.text.pdf.PdfNumber(90)`? If that doesn't work, try setting that to a variable and then passing that variable into the `put` method. – Chris Haas Mar 14 '16 at 17:27
  • Looks like you just can't do it inline. I've now got it where I'm creating the object and putting it in a variable, and it's not complaining about that. Something still isn't working though. I think it might be an issue with the `$stamper` at the bottom. When I run the debugger, it looks like `$stamper` is still null after it should be initialized. The PdfStamper constructor calls for an OutputStream. Am I providing that correctly? I want it to overwrite the PDF file after it's rotated. – Fungusface Mar 14 '16 at 17:49
  • You can't stamp to the same file/stream source as the reader. To get around that you need to either stamp to a new file and then rename it to the old file or in the `PdfReader` constructor you can [read the entire source in as a byte array](http://stackoverflow.com/a/15638899/231316) using `[System.IO.File]::ReadAllBytes $path`. Unfortunately I don't have time to test and format this for you but hopefully it gets you on the right path. – Chris Haas Mar 14 '16 at 18:22

1 Answers1

2

Maybe we're working off different versions of powershell, but the first problem I'm having with your sample function is here,

[iTextSharp.text.pdf.PdfName]::ROTATE;

which throws the following exception:

The field or property: "ca" for type: "iTextSharp.text.pdf.PdfName" differs only in letter casing from the field or property: "CA". The type must be Common Language Specification (CLS) compliant.

Looking at the iTextSharp source code, there are two separate fields as noted in the exception:

  • PdfName.CA
  • PdfName.ca

Haven't written any powershell in a while, so the simplest workaround was to instantiate a new PdfName object with the same string used for PdfName.ROTATE in the source. Anyway, hopefully the following gets you started:

function Rotate-Pdf {
    [CmdletBinding()]
    param(
        [parameter(Mandatory=$true)] [string]$readerPath
        ,[parameter(Mandatory=$true)] [float]$degrees
    )
    $reader = New-Object iTextSharp.text.pdf.PdfReader($readerPath);
    $rotate = New-Object iTextSharp.text.pdf.PdfName('Rotate');
    $pdfNumber = New-Object iTextSharp.text.pdf.PdfNumber($degrees);
    $pageCount = $reader.NumberOfPages;
    for ($i = 1; $i -le $pageCount; $i++) {
        # $rotation = $reader.GetPageRotation($i);
        $pageDict = $reader.GetPageN($i);
        $pageDict.Put($rotate, $pdfNumber);
    }
    $memoryStream = New-Object System.IO.MemoryStream;
    $stamper = New-Object iTextSharp.text.pdf.PdfStamper($reader, $memoryStream);
    $stamper.Dispose();
    $bytes = $memoryStream.ToArray();
    $memoryStream.Dispose();
    $reader.Dispose();
    return $bytes;
}
$bytes = Rotate-Pdf $input 90;
[System.IO.File]::WriteAllBytes($output, $bytes);

Note that there's an extra parameter for degrees to rotate, and have commented out $reader.GetPageRotation(). Depending on how a PDF is created, you cannot always count on PdfReader.GetPageRotation().

UPDATE:

Confirmed the exception noted above is specific to PowerShell 4.0. Didn't test V3.0, but when using V2.0, [iTextSharp.text.pdf.PdfName]::ROTATE does not throw a ExtendedTypeSystemException, and runs without issue.

Community
  • 1
  • 1
kuujinbo
  • 9,272
  • 3
  • 44
  • 57