0

I'm working with a massive XML file that is exported from Confluence to represent the current state of a given Confluence space. For those familiar with Confluence this is used for backing up and restoring or migrating Confluence spaces in or across environments.

I'm trying to automate some basic analysis on the XML so I can output some useful information for determining if our export data is "OK" based on a set of rules we have defined.

Given the size of some of these exports and the structure of the XML it can be a pain and very time consuming to analyze this manually.

Essentially I've whittled down the XML to a IEnumerable of "object" XElements.

var filename = "export.xml";
var currentDirectory = Directory.GetCurrentDirectory();
var confluenceExportFilePath = Path.Combine(currentDirectory, filename);
XDocument confluenceExport = XDocument.Load(confluenceExportFilePath);
var objects = confluenceExport.Descendants("object");

Then I've taken that further and only selected objects that contain a class attribute equal to "Page" as I only care about the "objects" that are Page "objects". Up to this point I've returned some basic "header" information about each Page.

var pages =
from page in objects
where (string)page.Attribute("class") == "Page"
select new Page
{
    Id = (string)page.Element("id"),
    Title = (string)page.Elements("property").FirstOrDefault(property => 
        property.Attribute("name").Value == "title"),
    Version = (int)page.Elements("property").FirstOrDefault(property => 
        property.Attribute("name").Value == "version"),
}; 

An example page "object" may look like this:

<object class="Page" package="com.atlassian.confluence.pages">
    <id name="id">001</id>
    <property name="title"><![CDATA[Test Page]]></property>
    <property name="lowerTitle"><![CDATA[test page]]></property>
    <property name="version">022</property>
    <property name="creationDate">2020-06-15 20:13:00.195</property>
    <property name="lastModificationDate">2020-06-18 12:01:04.482</property>
    <property name="versionComment"><![CDATA[]]></property>
    <collection name="bodyContents" class="java.util.Collection">
        <element class="BodyContent" package="com.atlassian.confluence.core">
            <id name="id">011</id>
        </element>
    </collection>
    <collection name="historicalVersions" class="java.util.Collection">
        <element class="Page" package="com.atlassian.confluence.pages">
            <id name="id">021</id>
        </element>
        <element class="Page" package="com.atlassian.confluence.pages">
            <id name="id">022</id>
        </element>
    </collection>
    <property name="contentStatus"><![CDATA[current]]></property>
    <collection name="attachments" class="java.util.Collection">
        <element class="Attachment" package="com.atlassian.confluence.pages">
            <id name="id">031</id>
        </element>
        <element class="Attachment" package="com.atlassian.confluence.pages">
            <id name="id">032</id>
        </element>
    </collection>
</object>

However, I wanted to dig a little deeper into the XML and get some more specific data and I'm struggling to do that. For example, I would like to select the "id" value that is nested inside the BodyContent collection.

    <collection name="bodyContents" class="java.util.Collection">
        <element class="BodyContent" package="com.atlassian.confluence.core">
            <id name="id">011</id>
        </element>
    </collection>

Ultimately what I would like is to be able to output:

Page ID: 001
Page Title: Test Page
Page Version: 022
Page Body Content ID: 011

How can I go about getting this?

bacis09
  • 109
  • 1
  • 8
  • I'm pretty sure Confluence pages' data can be exported to JSON. It would probably be easier for you. –  Feb 09 '22 at 22:02
  • I believe you are correct, but in this situation I have no control over what format the export data is being handed to us. In some cases we may receive JSON, but we have to be able to handle XML. – bacis09 Feb 09 '22 at 22:09

2 Answers2

1

The code below looks for the first element with the class BodyContent and takes the value of its id child element. For the xml in your example, these search criteria will suffice.

var pages =
    from page in objects
    where (string)page.Attribute("class") == "Page"
    select new Page
    {
        BodyContentId = 
            (string)page
                .Descendants("element")
                .Where(o => (string)o.Attribute("class") == "BodyContent")
                .FirstOrDefault()?.Element("id")
                
        // Other properties
    };

Giving you also a pointer to a post about how to handle large xml files.
In short, use an XmlReader to loop over the page <object class="Page> elements and only load an XElement/XDocument per single page onto which you apply the Linq statements above.

pfx
  • 20,323
  • 43
  • 37
  • 57
1

If you want to dig deep, then you can directly use XPath to retrieve the required values.

Code snippet:

var docNav = new XPathDocument(FILE_PATH);
var navigator = docNav.CreateNavigator();
var nodeIterator = navigator.Select("//object");
while (nodeIterator.MoveNext())
{
    Console.WriteLine("Page ID: {0}", nodeIterator.Current.SelectSingleNode("id")?.Value);
    Console.WriteLine("Page Title: {0}", nodeIterator.Current.SelectSingleNode("property[@name='title']")?.Value);
    Console.WriteLine("Page Version: {0}", nodeIterator.Current.SelectSingleNode("property[@name='version']")?.Value);
    Console.WriteLine("Page Body Content ID: {0}", nodeIterator.Current.SelectSingleNode("collection[@name='bodyContents']//id")?.Value);
};
codeninja.sj
  • 3,452
  • 1
  • 20
  • 37