0

I am fetching an SGML file and I extract data from it using uuDecoder and I create PDF out of it.

It is working fine since many years, but since last few months we are observing some of the PDF files are not able to load and it says "Failed to load PDF document" in chrome.

I have gone through this below question which has something similar to my case but it is in Python and I have it in c#

How can we figure out why certain uuencoded files are not decoding properly using Python?

Here is an example of a txt file that had an embedded uuencoded pdf that is having issue: https://www.sec.gov/Archives/edgar/data/1631661/000163166116000004/0001631661-16-000004.txt

My uuDecoder algorithm code is exact similar to this : http://blog.stevex.net/2004/04/c-classes-to-decode-yenc-and-uuencode-encoded-usenet-binaries/

I found out that it is throwing Index out of range exception in below code where it expects 61 characters in a line but some of the lines does not have exact 61 chacters:

public static byte[] uuDecode(string buffer) 
        { 
            // Create an output array
            byte[] outBuffer = new byte[(buffer.Length-1)/4*3];
            int outIdx = 0;

            // Get the string as an array of ASCII bytes
            byte[] asciiBytes = Encoding.ASCII.GetBytes(buffer);

            for (int i=0; i<asciiBytes.Length; i++)
            {
                asciiBytes[i] = (byte)((asciiBytes[i]-0x20) & 0x3f);
            }

            // Convert each block of 4 input bytes into 3 
            // output bytes
            for (int i = 1; i <= (asciiBytes.Length-1); i += 4) 
            { 
                outBuffer[outIdx++] = (byte)(asciiBytes[i] << 2 | asciiBytes[i+1] >> 4);
                outBuffer[outIdx++] = (byte)(asciiBytes[i+1] << 4 | asciiBytes[i+2] >> 2);
                outBuffer[outIdx++] = (byte)(asciiBytes[i+2] << 6 | asciiBytes[i+3]);
            } 

            return outBuffer;
        } 

Please note there is not anything related to "Index out of range" exception here so please dont redirect this to there.

I tried to fill missing characters with blank space as below:

if (line.Length < 61) ////Making sure length is 61 characters
                {
                    var builder = new StringBuilder();
                    builder.Append(line);
                    var missing = 61 - line.Length;

                    for (int i = 0; i < missing; i++)
                    {
                        builder.Append(" ");
                    }

                    line = builder.ToString();

                }

Can someone please help me to get why this is not working for few PDF document?

Community
  • 1
  • 1
Neel
  • 11,625
  • 3
  • 43
  • 61

1 Answers1

3

I think the problem is in spaces. Improved uuEncode uses `` (code 0x60) instead of space (0x20) which can be trimmed. Try to pad all lines at right to full size. Try this conversion (pdf.uue is the uuencoded part of file - from begin to end ):

        string[] all = File.ReadAllLines(@"d:\tmp\pdf.uue");
        for (int i = 1; i < all.Length - 2; i++)
        {
            if (all[i].Length < 61)
                all[i] = all[i].PadRight(61, ' ');
        }
        File.WriteAllLines(@"d:\tmp\pdf-2.uue", all);

The loop is 1 .. Length-2 to skip begin/end lines.

i486
  • 6,491
  • 4
  • 24
  • 41
  • I dont get you can you please explain more – Neel Jun 01 '16 at 14:16
  • @Neel Test my example. I used WinRar to uudecode the original uuencoded part and resulting PDFs were wrong. After the fix, they seem ok. – i486 Jun 01 '16 at 14:25
  • I guess you filled remaining missing characters with blank space correct? – Neel Jun 01 '16 at 14:30
  • Yes, it is filled with spaces which is eedecoded as 0. The problem is that space is "unsafe" code - it can be removed from the end of line to optimize the text file. But in this case it is real data. – i486 Jun 01 '16 at 14:34
  • I have already tried that mate, it started loading for some of them but still there are so many documents which are still not opening :( – Neel Jun 01 '16 at 14:36
  • In the above example, all 4 pdf-s were opened. Post another link which is not working... – i486 Jun 01 '16 at 14:39
  • pdf opens there in SEC site but when I try to fetch different encoded file from https://www.sec.gov/Archives/edgar/data/1631661/000163166116000004/0001631661-16-000004.txt then its failing and reason is that only that spaces are missing. can ypu please suggest something else? – Neel Jun 01 '16 at 14:43
  • To calculate the size of `outBuffer`, use the first byte - 0x20 - not formula with `/4 * 3`. E.g. letter `M` is 0x4D - 0x20 = 0x2D = 45 bytes. – i486 Jun 01 '16 at 14:43
  • you mean like this? byte[] outBuffer = new byte[(buffer.Length - 1) / 0x20]; – Neel Jun 01 '16 at 14:48
  • byte[] outBuffer = new byte[(buffer.Length - 1) - 0x20]; but its already there in below lines of that code – Neel Jun 01 '16 at 15:04
  • `byte[] outBuffer = new byte[ (int)buffer[0] - 0x20 ];` – i486 Jun 01 '16 at 20:21
  • no buddy its not working :( can there be possibility like the file contents itself are corrupted? I mean many of those are not in 61 characters format and even I fill those missing char with blank space then also it does not work. should i try any other algorithem? please help – Neel Jun 02 '16 at 06:09
  • @Neel "many of these are not in 61 char format" - if the first char is M then the line must be 61 chars, padded with spaces at right. If the first letter is not M then the line can be shorter. Try to uudecode with WinRar or other tool to test whether pdf-s are wrong. – i486 Jun 02 '16 at 07:21
  • thanks. I tried that particular data with online uudecoder and it worked perfectly there but it is not loading in my code. Can you please tell me what might be wrong? I tried here http://www.webutils.pl/index.php?idx=uu can anyhow I get the uudecoder code used for this side? – Neel Jun 02 '16 at 09:02
  • this line is metioned there on that website "If the count of data bytes is not divisible by three, one or two additional bytes of zero are appended. These are not included in the count at the beginning of the last line." what we have done is this only? – Neel Jun 02 '16 at 09:03
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/113698/discussion-between-neel-and-i486). – Neel Jun 03 '16 at 05:38