2

I have a PDF file generated by Microsoft Word. The user has specified a "highlight" color of black to make the text look like it's a black box (and make the text look like its been redacted). I'd like to change the black boxes to yellow so that the text is highlighted instead.

Ideally, I'd like to do this in Python.

Thanks!

vy32
  • 28,461
  • 37
  • 122
  • 246
  • 1
    Just a clarification, there is no such thing as a "highlight" color in PDF. What you probably have is just a black rectangle being drawn below the text. – yms Feb 19 '13 at 15:35
  • Of course. The yellow box underneath the black text will highlight it. – vy32 Feb 19 '13 at 18:55
  • 1
    What I mean is that you will have a hard time identifying "highlight" rectangles from all other drawings on the page, since they will all use the same PDF drawing instructions. – yms Feb 19 '13 at 19:26
  • I would be happy to change *ALL* of the black rectangles to yellow rectangles. – vy32 Feb 19 '13 at 23:14

1 Answers1

2

Option 1: If a commercial library is an option, you can easily implement this with Amyuni PDF Creator .Net, the C# code would look like this:

using System.IO;
using Amyuni.PDFCreator;
using System.Collections;

//open a pdf document
FileStream testfile = new FileStream("test1.pdf", FileMode.Open, FileAccess.Read, FileShare.Read);
IacDocument document = new IacDocument(null);
document.Open(testfile, "");

//get the first page
IacPage page1 = document.GetPage(1);

//get all graphic objects on the page
IacAttribute attribute = page1.AttributeByName("Objects");

// listobj is an arraylist of objects
ArrayList listobj = (ArrayList)attribute.Value;

foreach (IacObject iacObj in listobj)
{
    //if the object is a rectangle and the background color is black then set it to yellow
    if ((IacObjectType)iacObj.AttributeByName("ObjectType").Value == (IacObjectType.acObjectTypeFrame && (int)obj.Attribute("BackColor").Value == 0)
    {
        obj.Attribute("BackColor").Value = 0x00FFFF; //Yellow   
    }
}

I suppose you could translate this to IronPython instead.
Usual disclaimer applies for this suggestion

Option 2: If a commercial library is not an option and you are not developing a commercial closed-source application, you could try a bit of unreliable hacking on the page content using iText:

You can try decoding the page content (see ContentByteUtils class in iText for details), inserting a color selection operator before every fill operator, then resave the file. For more details on these operators see the TABLE 4.10 Path-painting operators of the Adobe PDF reference document.

Operand f: Fill the path, using the nonzero winding number rule to determine the region to fill (see “Nonzero Winding Number Rule” on page 232).

Operand rg: sets the nonstroking color space to DeviceRGB, and sets the nonstroking color to the specified value

Operand q: saves the current graphic state

Operand Q: Restores the saved graphic state

So if you have a sequence of operators on your page:

0.0 0.0 0.0 rg % Set nonstroking color to black
25 175 175 −150 re % Construct rectangular path
f % Fill path

It should become:

0.0 0.0 0.0 rg % Set nonstroking color to black
25 175 175 −150 re % Construct rectangular path
q % Saves the current graphic state
1.0 1.0 0.0 rg % Set nonstroking color to yellow
f % Fill path
Q % Restores the saved graphic state

Some remarks:
-This approach will turn every non-text drawing into yellow (including lines, curves, etc and excluding raster images) and it will also draw as yellow any text that is drawn on the page using the same drawing operators as other PDF drawings.
-Xforms and annotations used on the page will not be processed.
-If the documents you will process are produced by the same tool in the same way you may just test a few files and see how it goes.

Important: This is just an untested idea from the top of my head, it may work, or it may not.

yms
  • 10,361
  • 3
  • 38
  • 68
  • Great. Thanks. Now I need to figureo ut how to do this with Python. – vy32 Feb 25 '13 at 21:54
  • You could try with pyPdf or PDFMiner, but I am not sure if they will allow you to modify the page content. – yms Feb 25 '13 at 21:56