How to extract embedded PDF from XML file?

Question

I'm stuck. I have an XML file, which has a PDF file embedded under FILE node. How do I extract the file? I have nothing to start with. I can get the values of "normal" nodes easily, but how to extract text into binary PDF file?

Here is the interesting section in my XML:

...
<FILES>
<FILE datasetclassification="Not Defined" datasetdescription="" datasettype="PDF" datasetname="BG01119588_A_PDF_2" name="BG01119588_A_PDF_2.pdf">
 JVBERi0xLjcKJeTjz9IKNiAwIG9iago8PC9MZW5ndGggNyAwIFIvRmlsdGVyL0Zs YXRlRGVjb2RlPj4Kc3RyZWFtCnicAwAAAAABCmVuZHN0cmVhbQplbmRvYmoKNyAw IG9iago4CmVuZG9iago4IDAgb2JqCjw8L1N1YnR5cGUvSW1hZ2UvV2lkdGggNDg1 L0hlaWdodCAxNzcvQml0c1BlckNvbXBvbmVudCA4L0NvbG9yU3BhY2UvRGV2aWNl
...

I would like to get the PDF out of the XML.

You should check how the binary data is encoded. Then decode it and write it to disk. My guess is, it's base64 encoded. Take a look [here](https://stackoverflow.com/questions/19893/how-do-you-embed-binary-data-in-xml) — Jürgen Müller, Mar 28 '19 at 08:54
It must be a base 64 string. You need to get the innertext of the FILE tag and then use Convert.FromBase64String(string) — jdweng, Mar 28 '19 at 09:39
Assuming that's base64 (and it should be), if the PDF data is huge you can stream through the file using `XmlReader` and decode the PDFs using [`XmlReader.ReadElementContentAsBase64(Byte[] buffer, Int32 index, Int32 count)`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreader.readelementcontentasbase64?view=netframework-4.7.2). For details see [XmlReader - How to read very long string in element without System.OutOfMemoryException](https://stackoverflow.com/q/54126687/3744182). — dbc, Mar 28 '19 at 09:45

How to extract embedded PDF from XML file?

0 Answers0