1

I am trying to replace an image stream within an SDF document, using PDFNet 7.0.4 and netcoreapp3.1. As much as possible, I want to maintain the original object and its metadata; same dimensions, color system, compression, etc. Ideally object number and even generation would remain the same as well - the goal is that a before and after comparison would show only the changed pixels within the stream.

I'm getting the raw pixel data as a Stream object with this method:

private Stream GetImageData(int objectNum)
{
    var image = new PDF.Image(sdfDoc.GetObj(objectNum));

    var bits = image.GetBitsPerComponent();
    var channels = image.GetComponentNum();
    var bytesPerChannel = bits / 8;
    var height = image.GetImageHeight();
    var width = image.GetImageWidth();

    var data = image.GetImageData();
    var len = height * width * channels * bytesPerChannel;

    using (var reader = new pdftron.Filters.FilterReader(data))
    {
        var buffer = new byte[len];
        reader.Read(buffer);                

        return new MemoryStream(buffer);
    }
}

After manipulating the image data, I want to update it before saving the underlying SDFDoc object. I've tried using the following method:

private void SetImageData(int objectNum, Stream stream)
{
    var image = new PDF.Image(sdfDoc.GetObj(objectNum));

    var bits = image.GetBitsPerComponent();
    var channels = image.GetComponentNum();
    var bytesPerChannel = bits / 8;
    var height = image.GetImageHeight();
    var width = image.GetImageWidth();

    var len = height * width * channels * bytesPerChannel;
    if (stream.Length != len) { throw new DataMisalignedException("Stream length does not match expected image dimensions"); }

    using (var ms = new MemoryStream())
    using (var writer = new pdftron.Filters.FilterWriter(image.GetImageData()))
    {
        stream.CopyTo(ms);
        writer.WriteBuffer(ms.ToArray());
    }
}

This runs without error, but nothing actually appears to get updated. I've tried playing around with SDFObj.SetStreamData(), but haven't been able to make that work either. What is the lowest impact, highest performance way to directly replace just the raw pixel data within an image stream?


edit

I have this halfway working with this method:

private void SetImageData(int objectNum, Stream stream)
{
    var sdfObj = sdfDoc.GetObj(objectNum);
    var image = new PDF.Image(sdfObj);

    var bits = image.GetBitsPerComponent();
    var channels = image.GetComponentNum();
    var bytesPerChannel = bits / 8;
    var height = image.GetImageHeight();
    var width = image.GetImageWidth();

    var len = height * width * channels * bytesPerChannel;
    if (stream.Length != len) { throw new DataMisalignedException("Stream length does not match expected image dimensions"); }

    var buffer = new byte[len];
    stream.Read(buffer, 0, len);
    sdfObj.SetStreamData(buffer);
    sdfObj.Erase("Filters");
}

This works as expected, but with the obvious caveat that it just ignores any existing compression and turns the image into a raw uncompressed stream.

I've tried sdfObj.SetStreamData(buffer, image.GetImageData()); and sdfObj.SetStreamData(buffer, image.GetImageData().GetAttachedFilter()); and this does update the object in the file, but the resulting image fails to render.

superstator
  • 3,005
  • 1
  • 33
  • 43
  • Are you looking for a general solution? Because some PDF image formats are 1bit per-channel (8 pixels per byte). Perhaps you could elaborate on why this is important, there may be an easier way to accomplish this task. – Ryan Feb 19 '20 at 00:34
  • Practically speaking we’ll only be updating 8bit or 16bit per channel images. The goal is just to keep it fast, and make the absolute minimum of structural changes to the document – superstator Feb 19 '20 at 00:50
  • You are on the right track with your latest code. You want to use "stream.SetStreamData(byteArray, new FlateEncode(null));". Note the 2nd parameter will will flate (zlib) compress your image data. If that doesn't help, can you provide a specific input PDF file for review (and the image you want to edit)? – Ryan Feb 21 '20 at 23:07
  • I'm looking at https://arxiv.org/pdf/2002.04610.pdf again, object 381. When I try passing the FlateEncode object to SetStreamData(), it still corrupts the image somehow. We also want to be able to handle other filter types like DCT, JPX, or LZW – superstator Feb 21 '20 at 23:37
  • If I just do the compression myself using `System.IO.Compression.DeflateStream` that works OK, but still leaves the issue of DCT/JPX etc – superstator Feb 22 '20 at 00:13
  • I think you are essentially there, you just need to delete the "DecodeParms" entry which would no longer be correct for the updated compressed bytes. – Ryan Feb 22 '20 at 01:32

1 Answers1

1

The following code shows how to retain an Image object, but change the actual stream data.

static private Stream GetImageData(Obj o)
{
    var image = new pdftron.PDF.Image(o);

    var bits = image.GetBitsPerComponent();
    var channels = image.GetComponentNum();
    var bytesPerChannel = bits / 8;
    var height = image.GetImageHeight();
    var width = image.GetImageWidth();

    var data = image.GetImageData();
    var len = height * width * channels * bytesPerChannel;

    using (var reader = new pdftron.Filters.FilterReader(data))
    {
        var buffer = new byte[len];
        reader.Read(buffer);
        return new MemoryStream(buffer);
    }
}

static private void SetImageData(PDFDoc doc, Obj o, Stream stream)
{

    var image = new pdftron.PDF.Image(o);

    var bits = image.GetBitsPerComponent();
    var channels = image.GetComponentNum();
    var bytesPerChannel = bits / 8;
    var height = image.GetImageHeight();
    var width = image.GetImageWidth();

    var len = height * width * channels * bytesPerChannel;
    if (stream.Length != len) { throw new DataMisalignedException("Stream length does not match expected image dimensions"); }

    o.Erase("DecodeParms"); // Important: this won'be accurate after SetStreamData
    // now we actually do the stream swap
    o.SetStreamData((stream as MemoryStream).ToArray(), new FlateEncode(null));
}

static private void InvertPixels(Stream stream)
{
    // This function is for DEMO purposes
    // this code assumes 3 channel 8bit
    long length = stream.Length;
    long pixels = length / 3;
    for(int p = 0; p < pixels; ++p)
    {
        int c1 = stream.ReadByte();
        int c2 = stream.ReadByte();
        int c3 = stream.ReadByte();
        stream.Seek(-3, SeekOrigin.Current);
        stream.WriteByte((byte)(255 - c1));
        stream.WriteByte((byte)(255 - c2));
        stream.WriteByte((byte)(255 - c3));
    }
    stream.Seek(0, SeekOrigin.Begin);
}

And then here is sample code to use.

static void Main(string[] args)
{
    PDFNet.Initialize();

    var x = new PDFDoc(@"2002.04610.pdf");
    x.InitSecurityHandler();

    var o = x.GetSDFDoc().GetObj(381);
    Stream source = GetImageData(o);
    InvertPixels(source);
    SetImageData(x, o, source);
    x.Save(@"2002.04610-MOD.pdf", SDFDoc.SaveOptions.e_remove_unused);
}
Ryan
  • 2,473
  • 1
  • 11
  • 14
  • Thanks, DecodeParms was indeed the last missing piece. I take it there is no built-in way to handle the DCT side of things without using GDI or System.Drawing? – superstator Feb 22 '20 at 03:25
  • Object 381 in the PDF you sent was also compressed as Flate (zlib). But if you want to handle image data that is using other compression methods (and note, its possible to chain filters in the PDF) then you can, but only by creating a brand new image object, and setting the encoding. That would then require swapping out the old object (381) and having the XObject resource key point to the new image object, which seemed you explicitly did not want to do. – Ryan Feb 24 '20 at 17:32