Find and update in binary file

Question

I have PDF files which are result of merge of multiple tiff files together. So every page is actually ImageXObject. Every page looks like this if opened by PDF Walker

7 0 obj
<<
    /Type /XObject
    /Subtype /Image
    /Width 1653
    /Height 2339
    /BitsPerComponent 4
    /ColorSpace [ /Indexed /DeviceRGB 15 8 0 R ]
    /DecodeParms [ <<
    /Columns 1653
>> ]
    /Filter [ /FlateDecode ]
    /Length 219260
>>

I found the PDF has wrongly generated related content stream - it is missing few last lines of data. If I try to open such PDF in Acrobat reader there is shown error message Insufficient data for an image. It can be resolved if the length is lowered by e.g. 10 (defined constant).

Text /Height 2339 should be updated to e.g. /Height 2330. Which will be sufficient to overcome the issue.

If something like that should be done in text file I would use regular expressions to find the page heights and update them as need. But I am not sure how best to handle the update in binary file.

Note: I am not asking about how to read/write binary files. PDF files can be loaded in memory e.g. as byte array. Questions is more about the approach how to handle the problem efficiently. Without need of looping through the array and comparing every six bytes for sequence /Height and then looking for next couple of bytes which should represent number of pixels etc.

Can you please describe more, on the nature of efficiency you are seeking.. Are you wondering if the file can be searched without loading its entire contents in memory? Or you already know the location, and want to edit it, without loading the entire file.. etc — Vikas Gupta, Sep 17 '14 at 16:34
The file is already in memory as result of conversion. I would like to update it before it is stored to db. I would like to know how the best get positions of the numbers next to /height in binary file. — mybrave, Sep 17 '14 at 16:45
"Best" is very vague. Personally, I would say "best" would be to fix the problem at the PDF generation side instead of manually tweaking bits later. If that's not an option, my second personal "best" recommendation would be to run the bytes through a PDF-aware library like iTextSharp which will allow you to walk the PDF as an object and hopefully fix things. If you don't want to do that either then you'll need to convert your PDF names into an array of ASCII bytes and [just search the master array](http://stackoverflow.com/a/283648/231316). — Chris Haas, Sep 17 '14 at 16:54
Fix the generation part is good recommendation but it is 3rd party. Reason to do the update is to have some workaround until it is fixed there. Agree "best" can be lot of things. What I wanted to say is to find a way how to get the positions without need of looping through whole file... — mybrave, Sep 17 '14 at 17:15
Checked the link you posted and there are good suggestions,thanks — mybrave, Sep 17 '14 at 17:25

Find and update in binary file

0 Answers0