6

Adobe Acrobat has the ability to redact PDF files (that is, actually remove the information, rather than simply drawing a black box on top of it). I would like to use this feature programmatically. To redact using the GUI you select the Mark for Redaction Tool, draw it over the text to be redacted, then Apply Redactions.

Is there any way to do this programmatically, either through AppleScript or some other way?

I know the (X,y) location of the text to be redacted.

Thanks!

vy32
  • 28,461
  • 37
  • 122
  • 246
  • Acrobat's scripting hooks are all funneled into Javascript, so I'm submitting that tag be added as well. I would also argue that redaction would be better handled by the system that is generating the PDFs in the first place. – Philip Regan May 18 '11 at 14:45
  • 2
    Be very careful about how you do redaction and what you consider to be redacted. Not done properly (ie, drawing a black box over existing text objects) is trivial to undo and still leaves the text in the document where it can be searched/extracted. Real redaction in PDF is non-trivial. – plinth May 18 '11 at 14:52
  • 1
    Yes, it is non-trivial, which is why i want to use software that does this. Acrobat has redaction functionality built in. – vy32 May 23 '11 at 02:57

5 Answers5

6

In order to properly redact a PDF, you need to Alter The Content Stream. This is Very Hard.

If you can find the portion of the content stream that draws the text you want removed, you're halfway there.

The other half is figuring out how to change the content stream such that you don't modify the rest of the document. If the next text draw operator is proceeded by a "tm" command (set the text matrix, which absolutely positions the next piece of text), it's easy. If not... you have to calculate the exact width of the text you're replacing (several different PDF libraries can do this), and alter the drawing commands to skip over that much stuff.

For Example:

BT
/F1 10 Tf
1 0 0 1 30 720 Tm
(Here's some text, and you only want to REDACT that upper case "redact" over there)Tj
*
(This text is positioned relative to the previous line)Tj
1 0 0 1 30 650 Tm
(This text is positioned absolutely, starting at 30, 650)Tj

So you'd have to break up that first (...)Tj line into (Here's some text, and you only want to)Tj, N 0 Td, and (that upper case "redact" over there)Tj... where the 'N' properly adjusts the position of the following text drawing operation such that it lands in EXACTLY THE SAME SPOT. So you'd need to know the precise width of " REDACT " using the font resource /F1 (whatever that turned out to be), sized to 10 points.

Just to make your life more exciting, you have to worry about kerned text too. You can provide little spacing adjustments inline with text thusly:

(This is taken from the first text drawn in the PDF Spec)

[(Adobe Sys)5(t)1(ems Inc)5(orporated)5( 20)5(08 \226 All rights)5( reser)-9(ved)]TJ

To properly redact "Incorporated", you need to determine that it's been split across two strings, and adjust the positioning of the string following it so it's in Exactly The Same Spot.

And strings can be <DEADBEEF> hex values rather than (plain old ascii).

Get the idea? And I haven't covered all the possibilities here, just the most common ones.

Like I said: This is Very Hard.


There's an acrobat plugin called Appligent Redax (no connection) that lets you draw annotations (or generate them via templates, regex, etc) and then run their code to handle the redaction. It should be possible to programmatically create their annotations and perhaps even activate their plugin: JS in a document can run a menu item.

Mark Storer
  • 15,672
  • 3
  • 42
  • 80
  • RE: "alter the drawing commands to skip over that much stuff"... Wouldn't that then still leave the redacted content within the PDF, still available for later extraction? I'm curious because PDFs have become the _lingua franca_ of eBook companies, and sometimes text needs to be removed due to copyright reasons. Going back to the page layouts is a drag, but we have to be sure the content can't be reused, even accidentally. – Philip Regan May 19 '11 at 10:43
  • I'm saying you have to *replace* some of the content stream with Something Else. `(Redact me, but not me)Tj` with `N 0 Td (, but not me)Tj`, where N is the correct width in points (almost certainly a floating point value). After you've properly removed all the redacted text, you then add your black boxes (which is quite trivial by comparison). – Mark Storer May 19 '11 at 17:03
  • PS: There's a couple different products out there that do redaction. One might be programmable. Appligent's Redax doesn't look it... but you could probably generate their annotations yourself once you know their format. – Mark Storer May 20 '11 at 16:13
  • 2
    I don't recommend N 0 Td, because the Td operator has the side effect of changing the start position of the current line, so operators like T* would not longer function correctly (I'm speaking from experience). The correct way of implementing text redaction is `[-N]TJ`, instead of N 0 Td. Otherwise you are correct, it's very hard to do. With Type0 fonts the text is not even human readable, because glyph IDs are used, which is not a meaningful encoding. If you're lucky, the ToUnicode CMap can be used to decode the text content. In other cases the logical structure information may help. – Tamas Demjen May 23 '11 at 09:52
  • 1
    I'd add that the N value in [-N]TJ must be 1000* larger than the actual dimension. – Tamas Demjen May 23 '11 at 10:30
  • Good catch. That hadn't occurred to me. I've never needed to Actually Perform Redaction. Another technique would be to determine the absolute position of all text and replace all relative operators with 'Tm', which would remove the chance of such things happening, at the cost of a larger content stream. – Mark Storer May 23 '11 at 16:15
  • Precisely what I was looking for. Thanks! – vy32 Jul 01 '13 at 02:14
2

You can use GroupDocs.Redaction for .NET to programmatically redact text in the PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction of the text. This is how you can perform the exact phrase redaction.

using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
     doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
     // Save the document to "*_Redacted.*" file.
     doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false }); 
} 

Disclosure: I work as Developer Evangelist at GroupDocs.

Usman Aziz
  • 100
  • 3
  • Wow. That's great. I had no idea! Is there a way to run this on the Mac or Linux, perhaps under Mono? – vy32 May 26 '19 at 14:02
  • 1
    @vy32, At the moment, the API doesn't support Mono. However, we are going to release the API for Java platform very soon. – Usman Aziz May 27 '19 at 08:37
  • 1
    @vy32 GroupDocs.Redaction has been released for Java platform. See: https://products.groupdocs.com/redaction/java – Usman Aziz Jul 15 '19 at 07:47
2

Here's a web page that goes through what you need to do. As others mentioned you have to do this in Javascript as that's what Acrobat's native scripting is.

http://acrobatusers.com/tutorials/2008/07/auto_redaction_with_javascript

While I use Acrobat regularly I've surprisingly never had a need to script it. I checked the dictionary for it and it looks like you'll have to write Javascript file, save it and then open it with Applescript if that's what you want to do (say as a service).

tell application "Adobe Acrobat Professional"
   do script "this.info.title;"
end tell

Here's Adobe's Javascript for Acrobat documentation

http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=JavaScript_SectionPage.70.1.html

Clark
  • 833
  • 4
  • 6
1

Within Adobe Acrobat you may be able to do this through the use of an ActionScript that can be invoked on a number of different events.

If you would like to do this in a seperate application there are a number of different tools in a variety of platforms that can create and manipulate PDF documents, although I have yet to find a feature rich open source library that can even come close to some of these offerings.

http://www.aspose.com/categories/.net-components/aspose.pdf-for-.net/default.aspx

http://www.aspose.com/categories/java-components/aspose.pdf-for-java/default.aspx

http://itextpdf.com/

iText is my personal favorite and worth every penny.

maple_shaft
  • 10,435
  • 6
  • 46
  • 74
  • "ActionScript" is the scripting language for Flash, a dialect of ECMAScript. PDF uses JavaScript, which is another dialect of ECMAScript. The two are very similar, but Are Different. – Mark Storer May 20 '11 at 16:03
  • 1
    Acrobat is GUI-commanded; is there a way to run Acrobat programmatically? – vy32 May 23 '11 at 02:57
-2

Redacting PDFs in general is a pretty complex task.

You can redact PDF pages for free on doXiview (https://doxiview.cib.de) The redact option is located on the right side.

Another approach is programmatically done by CIB pdf toolbox (https://pdftoolbox.cib.de/)

PatrickF
  • 594
  • 2
  • 11