4

I have large XML files ("ONIX" standard) I'd like to split. Basic structure is:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<!-- DOCTYPE is not always present and might look differently -->
<ONIXmessage> <!-- sometimes with an attribute -->
<header>
...
</header> <!-- up to this line every out-file should be identical to source -->
<product> ... </product>
<product> ... </product>
...
<product> ... </product>
<ONIXmessage>

What I want to do is to split this file into n smaller files of approximately same size. For this I'd count the number of <product> nodes, divide them by n and clone them into n new xml files. I have searched a lot, and this task seems to be harder than I thought.

  1. What I could not solve so far is to clone a new XML document with identical xml declaration, doctype, root element and <header> node, but without <product>s. I could do this using regex but I'd rather use xml tools.
  2. What would be the smartest way to transfer a number of <product> nodes to a new XML document? Object notation, like $xml.ONIXmessage.product | % { copy... }, XPath() queries (can you select n nodes with XPath()?) and CloneNode() or XMLReader/XMLWriter?
  3. The content of the nodes should be identical regarding formatting and encoding. How can this be ensured?

I'd be very grateful for some nudges in the right direction!

Beeblebrox
  • 93
  • 1
  • 9

2 Answers2

6

One way is to:

  1. Make copies of the xml-file
  2. Remove all productnodes in the copies
  3. Use a loop to copy one product at a time from the original file to one of the copies.
  4. When you reach your product-per-file limit, save the current file (copy) and create a new file.

Example:

param($path, [int]$maxitems)

$file = Get-ChildItem $path

################

#Read file
$xml = [xml](Get-Content -Path $file.FullName | Out-String)
$product = $xml.SelectSingleNode("//product")
$parent = $product.ParentNode

#Create copy-template
$copyxml = [xml]$xml.OuterXml
$copyproduct = $copyxml.SelectSingleNode("//product")
$copyparent = $copyproduct.ParentNode
#Remove all but one product (to know where to insert new ones)
$copyparent.SelectNodes("product") | Where-Object { $_ -ne $copyproduct } | ForEach-Object { $copyparent.RemoveChild($_) } > $null

$allproducts = @($parent.SelectNodes("product"))
$totalproducts = $allproducts.Count

$fileid = 1
$i = 0

foreach ($p in $allproducts) {
    #IF beggining or full file, create new file
    if($i % $maxitems -eq 0) {
        #Create copy of file
        $newFile = [xml]($copyxml.OuterXml)
        #Get parentnode
        $newparent = $newFile.SelectSingleNode("//product").ParentNode
        #Remove all products
        $newparent.SelectNodes("product") | ForEach-Object { $newparent.RemoveChild($_) } > $null
    }

    #Copy productnode
    $cur = $newFile.ImportNode($p,$true)
    $newparent.AppendChild($cur) > $null

    #Add 1 to "items moved"
    $i++ 

    #IF Full file, save
    if(($i % $maxitems -eq 0) -or ($i -eq $totalproducts)) {
        $newfilename = $file.FullName.Replace($file.Extension,"$fileid$($file.Extension)")
        $newFile.Save($newfilename)
        $fileid++
    }

}

UPDATE: Since performance was important here, I created a new version of the script that uses a foreach-loop and a xml-template for the copies to remove 99% of the read-operations and delete-operations. The concept is still the same, but it's executed in a different way.

Benchmark:

10 items, 3 per xml OLD solution: 0.0448831 seconds
10 items, 3 per xml NEW solution: 0.0138742 seconds
16001 items, 1000 per xml items OLD solution: 73.1934346 seconds
16001 items, 1000 per xml items NEW solution: 5.337443 seconds
Frode F.
  • 52,376
  • 9
  • 98
  • 114
  • Your code worked quite well - thanks a lot for that. I had to replace `SelectNodes("product")` with `SelectNodes("//product")` (same for `SelectSingleNode()`) What I could not solve so far: `@($parent.product).Count` gives `1` when no `` is left. I tried `($parent.product).Count` but that results in `NULL` when one `` is left. What could be a reliable way to get the count of nodes? – Beeblebrox May 06 '16 at 21:30
  • The code worked with your sample in PowerShell 5.0 (should work with 3.0+ as far as I can see). The point of `selectnodes("product")` is that I'm using it on the parent-node itself, so it shouldn't be necessary with "//" as long as the products have the same parent (which your sample had). "//" should only be necessary to detect the products in the first place (had to use it since DOCTYPE was "optional"). As for the product count, I guess you can use `@($parent.SelectNodes("product")).Count`. – Frode F. May 06 '16 at 21:36
  • Now I tried a file with 12,000 `` (about 80 MB). Processing took about 5 min which is too slow for me. Additionally the transfer of Unicode characters in [CDATA] sections did not work correctly. I opened both files in Notepad++. The source file seems to have a BOM, the output does not. Maybe that's the reason why Unicode chars are displayed as 2 characters in the output. Looks like I have to develop a text only version with regex... – Beeblebrox May 06 '16 at 22:34
  • Text-parsing is always faster, but requires more manual work. See update for faster version with xml-objects. You still need to handle encoding, tests (path not existing, no products in xml ++ ) etc. As for the encoding you may need to write using a textwriter/stream or specify the encoding when you read the file. `$newfile.Save(string filename)` writes UTF8 as far as I know (which you state in your xmldeclaration that you want), but it sounds like you're getting UTF16 or something else. We don't have the original data, so this is problem you need to figure out (you have the data) – Frode F. May 07 '16 at 09:42
  • It's important to remember StackOverflow is a free service. We are not here to do anyones work for free, but we try to help current and future readers solve specific problems. As a result of this many answers here are proof-of-concepts, like my solution above, that is used to showcase a concept/idea. They will usually require some modification or optimization before being used in a production environment. – Frode F. May 07 '16 at 10:22
  • I feel sad that you seem to interpret my comments as complaints. Quite contrary I was amazed about the promptness and quality of your replies. I'm still a PS noob and maybe my frustration about hours of research with no satisfying results resonated with my answers. I never would expect anyone to "do my work for free". I'm very grateful for stackoverflow which helped me more than once. I assure you I'm giving back some of that in my areas of expertise and sincerely hope this thread will be useful for future readers. If my comments caused you any inconvenience, please feel free to ignore them. – Beeblebrox May 08 '16 at 13:44
  • No problem, I'm glad to help. It was just a reminder for everyone who reads that we expect others to debug and modify a bit on their own before asking. The questions and answers here need to be reusable for others, so I don't want to make the answer too specific to your situation. The answer itself answers the question and sample data you've provided. If you have a problem with the solution because your real data is more complex, then the question itself should be updated with more realistic sample data. :-) or if the problem is more general, a new question. – Frode F. May 08 '16 at 13:58
-1

Just throwing a thought out there for you to consider, it is neither tested, nor complete:

Import the XML into an array. Divide the array.count by n, and then loop through the array exporting to new XML files. You might have to create n arrays before exporting.

e.g.: use the Import-Clixml and Export-Clixml cmdlets.

Presuming that all of the XML nodes are the same object type.

Arluin
  • 594
  • 1
  • 8
  • 21
  • Interesting approach. Unfortunately the content of the `s` can vary considerably. And I still would not know how to copy the header. – Beeblebrox May 05 '16 at 21:39
  • Without a copy of your XML it's hard to determine. However your XML file seems to contain "only" products and if you read them into an array using Import-Clixml what you will get is an array of Product objects. Each one can have different attributes. Then when you use Export-Clixml it will create new XML nodes from the array objects with the appropriate attributes. – Arluin May 06 '16 at 17:50
  • Have you tryied CliXML-cmdlets with the sample? CliXML != XML . It's a special format to export PowerShell-objects and it will fail if you try to import his sample. – Frode F. May 07 '16 at 00:02