PDF Add Text and Flatten

Question

I'm developing a web application that displays PDFs and allows users to order copies of the documents. We want to add text, such as "unpaid" or "sample", on the fly when the PDF is displayed. I have accomplished this using itextsharp. However, the page images are easily separated from the watermark text and extracted using a variety of freeware programs.

How can I add the watermark to the pages in the PDF, but flatten the page images and watermark together so that the watermark becomes part of the pdf page image, thereby preventing the watermark from being removed (unless the person wants to use photoshop)?

score 2 · Accepted Answer · edited May 23 '17 at 12:27

If I were you I would go down a different path. Using iTextSharp (or another library) extract each page of a given document to a folder. Then use some program (Ghostscript, Photoshop, maybe GIMP) that you can batch and convert each page to an image. Then write your overlay text onto the images. Finally use iTextSharp to combine all of the images in each folder back into a PDF.

I know this sounds like a pain but you should only have to do this once per document I assume.

If you don't want to go down this route, let me get you going on what you need to do to extract images. Much of the code below comes from this post. At the end of the code I'm saving the images to the desktop. Since you've got raw bytes so you could also easily pump those into a System.Drawing.Image object and write them back into a new PdfWriter object which is sounds like you are familiar with. Below is a full working WinForms app targetting iTextSharp 5.1.1.0

Option Explicit On
Option Strict On

Imports iTextSharp.text
Imports iTextSharp.text.pdf
Imports System.IO
Imports System.Runtime.InteropServices

Public Class Form1

    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        ''//File to process
        Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "SampleImage.pdf")

        ''//Bind a reader to our PDF
        Dim R As New PdfReader(InputFile)

        ''//Setup some variable to use below
        Dim bytes() As Byte
        Dim obj As PdfObject
        Dim pd As PdfDictionary
        Dim filter, width, height, bpp As String
        Dim pixelFormat As System.Drawing.Imaging.PixelFormat
        Dim bmp As System.Drawing.Bitmap
        Dim bmd As System.Drawing.Imaging.BitmapData

        ''//Loop through all of the references in the file
        Dim xo = R.XrefSize
        For I = 0 To xo - 1
            ''//Get the object
            obj = R.GetPdfObject(I)
            ''//Make sure we have something and that it is a stream
            If (obj IsNot Nothing) AndAlso obj.IsStream() Then
                ''//Case it to a dictionary object
                pd = DirectCast(obj, PdfDictionary)
                ''//See if it has a subtype property that is set to /IMAGE
                If pd.Contains(PdfName.SUBTYPE) AndAlso pd.Get(PdfName.SUBTYPE).ToString() = PdfName.IMAGE.ToString() Then
                    ''//Grab various properties of the image
                    filter = pd.Get(PdfName.FILTER).ToString()
                    width = pd.Get(PdfName.WIDTH).ToString()
                    height = pd.Get(PdfName.HEIGHT).ToString()
                    bpp = pd.Get(PdfName.BITSPERCOMPONENT).ToString()

                    ''//Grab the raw bytes of the image
                    bytes = PdfReader.GetStreamBytesRaw(DirectCast(obj, PRStream))

                    ''//Images can be encoded in various ways. /DCTDECODE is the simplest because its essentially JPEG and can be treated as such.
                    ''//If your PDFs contain the other types you will need to figure out how to handle those on your own
                    Select Case filter
                        Case PdfName.ASCII85DECODE.ToString()
                            Throw New NotImplementedException("Decoding this filter has not been implemented")
                        Case PdfName.ASCIIHEXDECODE.ToString()
                            Throw New NotImplementedException("Decoding this filter has not been implemented")
                        Case PdfName.FLATEDECODE.ToString()
                            ''//This code from https://stackoverflow.com/questions/802269/itextsharp-extract-images/1220959#1220959
                            bytes = pdf.PdfReader.FlateDecode(bytes, True)
                            Select Case Integer.Parse(bpp)
                                Case 1
                                    pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
                                Case 24
                                    pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
                                Case Else
                                    Throw New Exception("Unknown pixel format " + bpp)
                            End Select
                            bmp = New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat)
                            bmd = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat)
                            Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length)
                            bmp.UnlockBits(bmd)
                            Using ms As New MemoryStream
                                bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Jpeg)
                                bytes = ms.GetBuffer()
                            End Using
                        Case PdfName.LZWDECODE.ToString()
                            Throw New NotImplementedException("Decoding this filter has not been implemented")
                        Case PdfName.RUNLENGTHDECODE.ToString()
                            Throw New NotImplementedException("Decoding this filter has not been implemented")
                        Case PdfName.DCTDECODE.ToString()
                            ''//Bytes should be raw JPEG so they should not need to be decoded, hopefully
                        Case PdfName.CCITTFAXDECODE.ToString()
                            Throw New NotImplementedException("Decoding this filter has not been implemented")
                        Case PdfName.JBIG2DECODE.ToString()
                            Throw New NotImplementedException("Decoding this filter has not been implemented")
                        Case PdfName.JPXDECODE.ToString()
                            Throw New NotImplementedException("Decoding this filter has not been implemented")
                        Case Else
                            Throw New ApplicationException("Unknown filter found : " & filter)
                    End Select

                    ''//At this points the byte array should contain a valid JPEG byte data, write to disk
                    My.Computer.FileSystem.WriteAllBytes(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), I & ".jpg"), bytes, False)
                End If
            End If

        Next

        Me.Close()
    End Sub
End Class

score 1 · Answer 2 · answered Sep 22 '11 at 18:42

1

The whole page would have to be rendered as an image. Otherwise you're got "text objects" (the individual words/letters of the text), and the watermark object (the overlay image), which will always be distinct/separate parts of the page.

answered Sep 22 '11 at 18:42

Marc B

356,200
43
426
500

There are no text objects because the documents were scanned. The entire page is an image. In fact, the watermark is a text object. However, if I can make the watermark an image object, how can I flatten the watermark image and the page image together into one image? – DCNYAM Sep 22 '11 at 18:59
programatically, you'd have to extract the image of the page, merge it with the watermark, then replace the original page image with this new one. Be aware that some scanners will do OCR on the text and embed that in the pdf as well, which'd bypass this whole watermarking business. – Marc B Sep 22 '11 at 19:00
Any tips on how to extract the page images and replace them? I know about the OCR software, but they chose to scan the documents as images without OCR, and they've already scanned a few hundred thousand. – DCNYAM Sep 22 '11 at 19:05
No idea on asp.net. Most pdf manipulation I've used has been using pdflib (http://pdflib.com), which is pricey but full-featured. There's windows versions available, and it lets you manipulate pretty much everything in the pdf. – Marc B Sep 22 '11 at 19:06

PDF Add Text and Flatten

2 Answers2

Linked