0

I've saw the following question around SO: Create Multi-Page PDF from other PDFs

But it didn't replied what i need. Consider i have an PDF with 20 pages. So far so good.

From the same place, i can have a PDF with only one page. This one will be used as my template PDF. What i'm trying to do is to replace the content (FlateDecodeStream)(and length also) on the template PDF and generate a new single-paged one.

I got the PDF to work ; however, a small logo doesn't display and adobe reader says there is an problem to display the PDF correctly (google chrome and edge just doesn't display the logo, no error message).

I've tried to mess with the xref table in the end(manually adjusting values) but got the same results.

Is there anyone that has some knowledge on PDF to give me any input ?

I'm uploading the template_pdf and other one that i want to extract data and create a third pdf (using the template pdf but with the contents from another PDF). Also i'll be uploading a PDF i made manually that has error for displaying (it displays the data but without the JPEG logo).

Its everything here: https://drive.google.com/drive/folders/1tsGIbtbfwuATPQ6a_VPjnxLT4ozzNt0s?usp=sharing

I've been doing everything using HxD (to view hexadecimal content and copy\paste data)

Thanks in advance

EDIT: I'm adding the code i'm currently using for generating a PDF. Its an invalid PDF even with the xref table okay(with the proper positions). The code is extremly ugly, but for now i'm looking to make it work (instead of making a nice code)

static void Main(string[] args)
    {

        Console.WriteLine("Hello World!");


        var jpegLogo = File.ReadAllBytes(@"C:\test\Ginfes-Reboot\jpegLogo.raw");
        var pdfStream = File.ReadAllBytes(@"C:\test\Ginfes-Reboot\pdfStream.raw");
        using (BinaryWriter b = new BinaryWriter(
        File.Open(@"C:\test\Ginfes-Reboot\newPdf_newmethod.pdf", FileMode.Create)))
        {
            WritePDFAgain(b,jpegLogo,pdfStream);

        }

    }
    private static void WritePDFAgain(BinaryWriter b, byte[] jpegLogo,byte[] pdfStream)
    {
        List<long> offSets = new List<long>();
        string str = "%PDF-1.4" + "\n";
        var byteArr = Encoding.ASCII.GetBytes(str);
        b.Write(byteArr);
        byteArr = StringToByteArray("25E2E3CFD30A");
        b.Write(byteArr);
        offSets.Add(b.BaseStream.Position);//0
        str = "3 0 obj" + "\n" + "<</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width 60/Length 3857/Height 60/Filter/DCTDecode>>stream" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(jpegLogo);
        b.Write(Encoding.ASCII.GetBytes("\n"));
        b.Write(Encoding.ASCII.GetBytes("endstream" +"\n" + "endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//1
        str = "4 0 obj" + "\n" + "<</Length " + pdfStream.Length + "/Filter/FlateDecode>>stream" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(pdfStream);
        b.Write(Encoding.ASCII.GetBytes("\n"));
        b.Write(Encoding.ASCII.GetBytes("endstream" + "\n" + "endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//2
        str = "1 0 obj" + "\n" + "<</Group<</Type/Group/CS/DeviceRGB/S/Transparency>>/Parent 5 0 R/Contents 4 0 R/Type/Page/Resources<</XObject<</img0 3 0 R>>/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]/ColorSpace<</CS/DeviceRGB>>/Font<</F1 2 0 R>>>>/MediaBox[0 0 595 936]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//3
        str = "6 0 obj" + "\n" + "[1 0 R/XYZ 0 814 0]" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//4
        str = "2 0 obj" + "\n" + "<</BaseFont/Helvetica/Type/Font/Encoding/WinAnsiEncoding/Subtype/Type1>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//5
        str = "5 0 obj" + "\n" + "<</ITXT(2.1.7)/Type/Pages/Count 1/Kids[1 0 R]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//6
        str = "7 0 obj" + "\n" + "<</Names[(JR_PAGE_ANCHOR_0_1) 6 0 R]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//7
        str = "8 0 obj" + "\n" + "<</Dests 7 0 R>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//8
        str = "9 0 obj" + "\n" + "<</Names 8 0 R/Type/Catalog/ViewerPreferences<</PrintScaling/AppDefault>>/Pages 5 0 R>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//9
        str = "10 0 obj" + "\n" + @"<</Creator(JasperReports \(nfs_novo\))/Producer(iText 2.1.7 by 1T3XT)/ModDate(D:20191211152903-03'00')/CreationDate(D:20191211152903-03'00')>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("xref" + "\n" + "0 11" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("0000000000 65535 f " + "\n"));            
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(2) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(4) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000000"+ offSets.ElementAt(0) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(1) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(5) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(3) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(6) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(7) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(8) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(9) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("trailer" + "\n" + "<</Root 9 0 R/ID [<10a2f7fd162aa44a268ebb6f31cc98c4><c36ebb9dc93cd9a72f229f618092eeb0>]/Info 10 0 R/Size 11>>" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("startxref" + "\n" + (b.BaseStream.Position + 6) + "%%EOF" + "\n"));
    }

Files used: https://drive.google.com/drive/folders/1i3J-yioFvcoiakyc_Wi8ddn9g6Pxy7zd?usp=sharing

paboobhzx
  • 109
  • 10
  • 1
    In addition to updating the offsets in the xref table, you also need to change the `startxref` marker at the end of the file (assuming the overall file size has changed). Are you doing this too? – Bradley Smith Dec 17 '19 at 04:21
  • Yeah, i'm doing so. In order to don't have problems with the xref table, i changed the order of the components ; this way the last thing to be written is the FlateDecodeStream(on my template its possible to see this). Even so i can't get it to work. – paboobhzx Dec 17 '19 at 04:23
  • Well, looking at the resulting PDF, I can tell you what's wrong with it. The content stream for the page references an XObject called img12 which is not present in the document. You need to include the object as well as an entry for it in the page's Resources dictionary. (This looks to be the same as img0 in the final document, so it might be enough to just change its key in the dictionary to img12) – Bradley Smith Dec 17 '19 at 04:30
  • It worked. Let me try with other samples and so i'll tell you if this does the job =) ! – paboobhzx Dec 17 '19 at 04:34
  • 1
    Just be aware that this is only going to work in a very specific set of circumstances. If you want to be able to replace page content in PDFs more generally, there are a lot more steps to consider... but in this specific case, it looks like you only need to worry about making sure the objects referenced in the page contents properly match the key names in the Resources dictionary. – Bradley Smith Dec 17 '19 at 04:41
  • It worked. Let me try with other samples and so i'll tell you if this does the job =) ! Edit: can you help me with this template ? https://drive.google.com/open?id=106viNTGUsIkneCZuXn6YJf5Di2-chWx2 - this is the one i edited and added the flatedecode stream as the last item (so i wouldn't need to change the xref table). But it still doesn't display the log. Maybe because of the startxref marker ? – paboobhzx Dec 17 '19 at 04:45
  • I can see why. The xref table got corrupted. I've been comparing all values and they aren't matching in the file . I believe the best way to do this operation is going to be write each object manually, save its offset position and in the end manually write the xref table. Thanks in advance for your support. – paboobhzx Dec 17 '19 at 04:50
  • 1
    I think this demonstrates why developers tend to use external libraries for this sort of thing - if you are determined not to use an external library, then you will end up effectively writing your own in order to solve the problem. – Bradley Smith Dec 17 '19 at 04:53
  • An external library is adding some processing and memory consumption to our application ; this is why i'm trying to figure out a way of doing so. But in the end i'll have to write something on my own like you mentioned. Thanks ! – paboobhzx Dec 17 '19 at 04:54
  • @BradleySmith please take a look on my edit. I believe i'm on the way to get it working =) ! Also please answer the question so i can mark it as accepted answer. – paboobhzx Dec 17 '19 at 14:41

1 Answers1

1

You are most of the way there; the only problem with the resulting PDF from your example is that the image resource referenced in pdfStream is named img10, whereas the name you are assigning when you create the resource dictionary is img0.

Below is some code that will identify the correct referenced resource (using a regular expression on the page content), which you can then use when building the dictionary.

You need these additional using directives:

using System.IO.Compression;
using System.Text.RegularExpressions;

This method decompresses the page content stream and matches the image resource name:

private static string GetImageResourceName(byte[] pdfStream) {
    using (MemoryStream ms = new MemoryStream(pdfStream)) {                
        ms.Seek(2, SeekOrigin.Begin);   // skip first 2 bytes (zlib header)

        using (DeflateStream ds = new DeflateStream(ms, CompressionMode.Decompress)) {
            using (StreamReader sr = new StreamReader(ds)) {
                string contents = sr.ReadToEnd();

                // PostScript command referencing the image resource looks like: /img123 Do
                return Regex.Match(contents, @"\b(img\d+)\s+Do\b").Groups[1].Value;
            }
        }
    }
}

Finally, you only need to change this line in your WritePDFAgain method:

str = String.Format(
    "1 0 obj\n<</Group<</Type/Group/CS/DeviceRGB/S/Transparency>>" 
    + "/Parent 5 0 R/Contents 4 0 R/Type/Page/Resources<</XObject" 
    + "<</{0} 3 0 R>>/ProcSet [/PDF /Text /ImageB /ImageC " 
    + "/ImageI]/ColorSpace<</CS/DeviceRGB>>/Font<</F1 2 0 R>>>>" 
    + "/MediaBox[0 0 595 936]>>\n", 
    GetImageResourceName(pdfStream)
);

As per my disclaimer in the comments, this code will only work for this very specific case and input data. It is by no means a general purpose solution, but I think you accept that.

I will reiterate my point that if you are intent on not using any external libraries for this, then you will likely end up writing your own (albeit a very basic one).

Bradley Smith
  • 13,353
  • 4
  • 44
  • 57