1

I'm stuck. I have an XML file, which has a PDF file embedded under FILE node. How do I extract the file? I have nothing to start with. I can get the values of "normal" nodes easily, but how to extract text into binary PDF file?

Here is the interesting section in my XML:

...
<FILES>
<FILE datasetclassification="Not Defined" datasetdescription="" datasettype="PDF" datasetname="BG01119588_A_PDF_2" name="BG01119588_A_PDF_2.pdf">
 JVBERi0xLjcKJeTjz9IKNiAwIG9iago8PC9MZW5ndGggNyAwIFIvRmlsdGVyL0Zs YXRlRGVjb2RlPj4Kc3RyZWFtCnicAwAAAAABCmVuZHN0cmVhbQplbmRvYmoKNyAw IG9iago4CmVuZG9iago4IDAgb2JqCjw8L1N1YnR5cGUvSW1hZ2UvV2lkdGggNDg1 L0hlaWdodCAxNzcvQml0c1BlckNvbXBvbmVudCA4L0NvbG9yU3BhY2UvRGV2aWNl
...

I would like to get the PDF out of the XML.

Sigve Kolbeinson
  • 1,133
  • 1
  • 7
  • 16
MikkoR
  • 41
  • 1
  • 8
  • You should check how the binary data is encoded. Then decode it and write it to disk. My guess is, it's base64 encoded. Take a look [here](https://stackoverflow.com/questions/19893/how-do-you-embed-binary-data-in-xml) – Jürgen Müller Mar 28 '19 at 08:54
  • 1
    It must be a base 64 string. You need to get the innertext of the FILE tag and then use Convert.FromBase64String(string) – jdweng Mar 28 '19 at 09:39
  • Assuming that's base64 (and it should be), if the PDF data is huge you can stream through the file using `XmlReader` and decode the PDFs using [`XmlReader.ReadElementContentAsBase64(Byte[] buffer, Int32 index, Int32 count)`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreader.readelementcontentasbase64?view=netframework-4.7.2). For details see [XmlReader - How to read very long string in element without System.OutOfMemoryException](https://stackoverflow.com/q/54126687/3744182). – dbc Mar 28 '19 at 09:45
  • Ok, sounds reasonable. I will try. – MikkoR Mar 28 '19 at 20:57
  • Got it working! Thanks for help! – MikkoR Mar 29 '19 at 18:08

0 Answers0