Extracting embedded XML File from PDF A/3 using abcpdf in C# - ZUGFeRD

Question

I'm currently working with the new German ZUGFeRD files. These are PDF A/3 files who have an embedded XML file in them which contains data.

I want to extract this XML file from the PDF A/3 using abcpdf 8.1 with C#.

Any idea how to do this ?

Thanks a lot and regards,

I'd get a library which deals with PDF files, then load the attached XML with XDocument. — SKull, Feb 11 '14 at 12:08
My Library dealing with PDF files is abcpdf in Version 8.1. My question is how to extract/read the xml which is inside the pdf with this library(abcpdf). Thanks anyway !:) — user3296596, Feb 11 '14 at 12:19

Vad1mo · Accepted Answer · 2016-04-01T14:37:43.793

3

I don't know abcpdf but I guess that the pdf libs offer similar access to the pdfs content.

First take a look at Das-ZUGFeRD-Format_1p0.pdf. Especially page 112. The images shows the object tree you have to traverse in order to find the xml stream.

With this tree you have the names, the types and the direction. Now you can traverse the pdf object tree to get to the XML content that you are looking for.

The steps based on the diagram.

Read your PDF
Get the catalog inside your PDF
Get the Array with name AF from Catalog
Get first element from AF array (should be file spec)
From file spec get the dictionary named EF
Get the stream content of EF

This are the steps you need to perform in order to get to the content.

To display the structure of a pdf and browse the tree I would recommend to use a tool like iText RUPS

edited Apr 01 '16 at 14:37

answered Feb 11 '14 at 16:27

Vad1mo

5,156
6
36
65

Thanks a lot, this is pretty much the way to go ! What did i do with abcpdf: -Get the Objectsoup Array from the Doc (Pretty much an array of all Objects in the Doc) -as ZUGFeRD allows only one embedded file inside the PDF, i just searched this objectsoup-array for the one of the type StreamObject that contains /EmbeddedFile -Decompress the Stream of that object, get the byte[] of the stream and write it into an xml file – user3296596 Feb 12 '14 at 10:34
2

@user3296596 *ZUGFeRD allows only one embedded file inside the PDF* - That's wrong: *only one invoice but any number of other attachments,* "nur die Einbindung eines einzigen Rechnungsdatendokuments ... Einbettung weiterer Dokumente und Dateien, die keine Rechnungsdaten enthalten, ist davon nicht betroffen". Furthermore this is subject to change, "In künftigen Versionen des ZUGFeRD Standards kann diese Beschränkung aufgehoben werden", cf. the PDF-Implementierungsguide-ZUGFeRD.pdf from the info package. – mkl Feb 13 '14 at 08:14
@Vadimo I think it is not a good idea to remove the reference to http://www.ferd-net.de completely and only link to a copy of the specification on the site of an implementation of ZUGFeRD, http://konik.io; admittedly, though, wikipedia does reference those copies, too. – mkl Apr 03 '16 at 16:40

score -2 · Answer 2 · answered Feb 12 '14 at 10:39

-2

What did i do with abcpdf:

Get the Objectsoup Array from the Doc (Pretty much an array of all Objects in the Doc)
as ZUGFeRD allows only one embedded file inside the PDF, i just searched this objectsoup-array for the one of the type StreamObject that contains /EmbeddedFile
Decompress the Stream of that object, get the byte[] of the stream and write it into an xml file

answered Feb 12 '14 at 10:39

user3296596

29
1
4

2

*ZUGFeRD allows only one embedded file inside the PDF* - That's wrong: **only one invoice but any number of other attachments**, "nur die Einbindung eines einzigen Rechnungsdatendokuments ... Einbettung weiterer Dokumente und Dateien, die keine Rechnungsdaten enthalten, ist davon nicht betroffen". Furthermore this is subject to change, "In künftigen Versionen des ZUGFeRD Standards kann diese Beschränkung aufgehoben werden", cf. the PDF-Implementierungsguide-ZUGFeRD.pdf from the info package. – mkl Feb 12 '14 at 12:34

Extracting embedded XML File from PDF A/3 using abcpdf in C# - ZUGFeRD

2 Answers2