I'm working with a massive XML file that is exported from Confluence to represent the current state of a given Confluence space. For those familiar with Confluence this is used for backing up and restoring or migrating Confluence spaces in or across environments.
I'm trying to automate some basic analysis on the XML so I can output some useful information for determining if our export data is "OK" based on a set of rules we have defined.
Given the size of some of these exports and the structure of the XML it can be a pain and very time consuming to analyze this manually.
Essentially I've whittled down the XML to a IEnumerable
of "object" XElement
s.
var filename = "export.xml";
var currentDirectory = Directory.GetCurrentDirectory();
var confluenceExportFilePath = Path.Combine(currentDirectory, filename);
XDocument confluenceExport = XDocument.Load(confluenceExportFilePath);
var objects = confluenceExport.Descendants("object");
Then I've taken that further and only selected objects that contain a class attribute equal to "Page" as I only care about the "objects" that are Page "objects". Up to this point I've returned some basic "header" information about each Page.
var pages =
from page in objects
where (string)page.Attribute("class") == "Page"
select new Page
{
Id = (string)page.Element("id"),
Title = (string)page.Elements("property").FirstOrDefault(property =>
property.Attribute("name").Value == "title"),
Version = (int)page.Elements("property").FirstOrDefault(property =>
property.Attribute("name").Value == "version"),
};
An example page "object" may look like this:
<object class="Page" package="com.atlassian.confluence.pages">
<id name="id">001</id>
<property name="title"><![CDATA[Test Page]]></property>
<property name="lowerTitle"><![CDATA[test page]]></property>
<property name="version">022</property>
<property name="creationDate">2020-06-15 20:13:00.195</property>
<property name="lastModificationDate">2020-06-18 12:01:04.482</property>
<property name="versionComment"><![CDATA[]]></property>
<collection name="bodyContents" class="java.util.Collection">
<element class="BodyContent" package="com.atlassian.confluence.core">
<id name="id">011</id>
</element>
</collection>
<collection name="historicalVersions" class="java.util.Collection">
<element class="Page" package="com.atlassian.confluence.pages">
<id name="id">021</id>
</element>
<element class="Page" package="com.atlassian.confluence.pages">
<id name="id">022</id>
</element>
</collection>
<property name="contentStatus"><![CDATA[current]]></property>
<collection name="attachments" class="java.util.Collection">
<element class="Attachment" package="com.atlassian.confluence.pages">
<id name="id">031</id>
</element>
<element class="Attachment" package="com.atlassian.confluence.pages">
<id name="id">032</id>
</element>
</collection>
</object>
However, I wanted to dig a little deeper into the XML and get some more specific data and I'm struggling to do that. For example, I would like to select the "id" value that is nested inside the BodyContent collection.
<collection name="bodyContents" class="java.util.Collection">
<element class="BodyContent" package="com.atlassian.confluence.core">
<id name="id">011</id>
</element>
</collection>
Ultimately what I would like is to be able to output:
Page ID: 001
Page Title: Test Page
Page Version: 022
Page Body Content ID: 011
How can I go about getting this?