Extract Embedded XML from PDF with iTextSharp (C#)

Question

I need to extract XML data embedded in Bankruptcy court files with C#. In PDF Reader the file looks like a typical court doc. In Notepad the XML is buried in the text. I've tried extracting the text with this and another code snippet using SimpleTextExtractionStrategy. The first results in a file with no identifiable text from the PDF and the second outputs symbols. I also tried accessing it as an AcroField and Xfaform. It doesn't seem to be either of those based on the Watch window.

Stepping thru the code in Visual Studio, the XML shows up under PDFReader >> Catalog >> Keys >> Raw >> Non-Public Members >> dictionary in the Watch window. I have no idea how to get to it though. Since it's listed with other PDFNames in Watch I thought I might be able to access it via PDFReader.Catalog.GetAsDict, but it doesn't display as a PDFName. The provider of these files has a java app that seems to just reads the text. Not sure if I need to use a different extraction strategy, or directly access the catalog item containing the XML. I've never programmatically worked with PDF files or iTextSharp so I'm struggling. Any code suggestions?

score 4 · Accepted Answer · edited May 23 '17 at 11:44

4

It would help if you could share a PDF with an embedded XML. When I first read your question, I assumed that the XML would have been added as a document-level attachment (stored in EmbeddedFiles) or as an attachment annotations (stored in an Annot added to a page dictionary).

Reading what is written on the uscourts.gov, it looks as if the XML is actually an XMP stream. That would mean that you can find it in the Metadata entry of the Catalog (or maybe in a page dictionary).

If you can not share the file, you will have to help yourself. You can do this by downloading iText RUPS. It is a free tool to look inside a PDF.

Browse the tree structure and look for Metadata, look for EmbeddedFiles, look for Annots. If you don't tell us how the XML is embedded, nobody will be able to help you.

See my answer to the following question for an example: How to delete attachment of PDF using itext (look at how I use RUPS to look at the Catalog > Names > EmbeddedFiles).

Extra notes: the code you've tried so far is about extracting text from a page, NOT about extracting an XML file that is embedded inside a PDF.

Update:

Now that you've shared a file, I've used RUPS to find the XML file. Take a look at the following screen shot:

Screen shot

Do you see what happened here? Somebody added a custom entry named /USCTbankruptcynotice with a String as value straight to the catalog. That is so wrong: it is such a bad idea to store a file inside a string. Why didn't that developer store that file as a stream? I feel so sad for the person who employs such a developer.

This being said, this is how you can extract the XML:

PdfDictionary catalog = reader.Catalog;
PdfName name = new PdfName("USCTbankruptcynotice");
PdfString USCTbankruptcynotice = catalog.GetAsString(key);
string xml = USCTbankruptcynotice.ToString();

This is written from memory. Please update my answer if you need to apply small corrections.

edited May 23 '17 at 11:44

Community

1
1

answered Feb 03 '15 at 18:22

Bruno Lowagie

75,994
9
109
165

Doesn't look like there's a way to attach a file. There's a link to it from the court page - http://ebn.uscourts.gov/documents/yournotice.pdf. – cacosta Feb 03 '15 at 18:59
1

Sorry for the remarks about the strange way the XML is stored inside that PDF, but I am very passionate about PDF and I sometimes get carried away. You deserve an upvote for the inconvenience of being confronted with such PDF files. – Bruno Lowagie Feb 03 '15 at 19:16
No problem. And as long as I'm not the developer in question.... :) It's the US court system so you have to adjust your expectations. I've been cursing these files all morning. That code worked perfectly. Thank you so much! I just wasn't grasping how to access the PdfName – cacosta Feb 03 '15 at 19:24
Yes, and without looking inside the PDF file, nobody would have guessed. The US courts just invented a custom name... – Bruno Lowagie Feb 03 '15 at 19:26
Hi @BrunoLowagie It is very valuable. But i am trying to extract the xml file using Jquery/javascript/angular/ android/ or any hybrid mobile app technology. Is there any way to do so. Please help. The pdf is generated by itext – Ananta Prasad Jun 29 '16 at 05:48
@AnantaPrasadLoda Don't post new questions in a comment to a question that is more than 1 year old. Nobody is going to answer that. Post a new question. – Bruno Lowagie Jun 29 '16 at 05:53
Thanks @BrunoLowagie, I will create a question. Can you help me on this ? – Ananta Prasad Jun 29 '16 at 05:58
@AnantaPrasadLoda No, I can't. Your question is too broad. It mixes server-side and client-side technology. If you're asking for a Jquery answer, I think the question is absurd. It you're asking for a JavaScript (as in ECMAScript) answer, the question is unanswerable. And so on. The question in its current state is a bad question. – Bruno Lowagie Jun 29 '16 at 06:00
OK Thanks @BrunoLowagie – Ananta Prasad Jun 29 '16 at 06:02

Extract Embedded XML from PDF with iTextSharp (C#)

1 Answers1

Linked