Reading a PDF, character problems

Question

I'm trying to use PurePDF to gather some information inside a PDF file, but can't manage to have PurePDF read it.

Whenever PurePDF tries to read any pdf, it says it can't find its header, I tried debugging it and noticed the string read from bytearray are coming as japanese characters! I have tried changing the endian of my pdf's bytearray before passing it to PurePDF, but didn't change anything.

The pdf file is ok as I can see the "%PDF-" header whenever I open it as text, but for some reason actionscript is getting wrong charcodes so PurePDF just can't work at all.

Any ideas?

Thanks.

Update: I'm not a bytearray specialist, but I decided to man it and follow the code execution through the debugger, and found out it was using readInt() to get the characters, I just rewrote it to readByte() and now it is reading the PDF! I'm still to see if the features will work... Can anyone who is more into low-level programming explain me what might be happening? I don't think the project is broken in the svn

This is the code I have been using, I think it is quite straightforward:

private function loadPdf():void
    {
        var loader:URLLoader=new URLLoader();
        loader.dataFormat=URLLoaderDataFormat.BINARY;
        loader.addEventListener(Event.COMPLETE, onLoadComplete);
        loader.load(new URLRequest(PDF_FILE));
    }

protected function onLoadComplete(event:Event):void
    {
        var data:ByteArray = URLLoader(event.target).data as ByteArray;
        pdfReader = new PdfReader(data);
        pdfReader.readPdf();
    }

Not that I will know the answer if you do this, but I feel compelled to comment that you should show the code you are using. It will prevent people from giving you obvious answers (that aren't necessarily the problem) and if it is something obvious you are likely to get it identified quickly :) — Sunil D., Feb 17 '13 at 02:48
I have update my status and added the code I am using, thank you for your time. — rsantos, Feb 17 '13 at 04:55

score 0 · Answer 1 · answered Feb 18 '13 at 07:57

I haven't worked with PurePDF before but I have used bytearray to extract information from files. What exactly do you want to get from this pdf? Do you want to extract just text? Also can you upload a link to the PDF? Will be easier to help if we are looking at the same thing.

About the Japanese text... When you read the PDF in a bytearray don't expect to easily find human readable text because most of that data is for setting up file structure etc. Actual text & pictures from the PDF are placed inside tags called Streams. So usually you find a stream of text & extract that into your bytearray. To correctly display the text you then use the decoder-type (UTF-8, UTF-16 etc) as mentioned in PDF data.

This link below explains better about PDF streams: ( "/Length" becomes your bytearray length and "Filter" tells you the decode type (charset type eg. ASCII) etc )

http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/

Anyway all this makes sense if you open your PDF in a Hex editor. Try the one below if you need one. Now you can see where your streams positions are and tell AS3 to extract from there:

http://www.hhdsoftware.com/free-hex-editor

If there's still a problem, upload your PDF somewhere and say exactly what you're trying to extract from the document. I will try to give exact help for that (no promises, just trying to help).. Peace.

Thank you for taking the time to answer. After some time I found out that PurePDF pdf reading capabilities aren't fully implemented, that is possibly the reason I was getting "japanese characters" and other errors. It is a port of Java's iText library which I have tried and is has better data extraction capabilities, however the PDF i need to read has a quite irregular layout (tables and columns) so the data extracted ended up broken. I am now saving the pdf as a .txt file and writing a parser for it as the data is displayed in a quite complex way. Thanks again. — rsantos, Feb 19 '13 at 21:57
Just to add a comment maybe someone would find useful, I've been trying these last couple of hours to make PurePDF work, and no way... the reading functionality is broken. Tried from very simple pdfs to complex, all the same error. I even followed the suggestion from the google code page, (about changing the readInt() for readByte()) but other errors appeared. My suggestion, don't waste your time with it. — Artemix, Oct 06 '14 at 18:54

Reading a PDF, character problems

1 Answers1