Ran into a similar requirement in my work. My best effort (intuitive, ease of implementation, relatively performant) is as follows. I basically write with an XmlWriter
, monitoring the underlying stream. When it surpasses my file size limit, I complete the current Xml fragment, save file, close stream.
Then on a second pass, I load the full DOM into memory, and iteratively remove nodes and save document until it is of acceptable size.
For example
// arbitrary limit of 10MB
long FileSizeLimit = 10*1024*1024;
// open file stream to monitor file size
using (FileStream file = new FileStream("some.data.xml", FileMode.Create))
using (XmlWriter writer = XmlWriter.Create(file))
{
writer.WriteStartElement("root");
// while not greater than FileSizeLimit
for (; file.Length < FileSizeLimit; )
{
// write contents
writer.WriteElementString(
"data",
string.Format("{0}/{0}/{0}/{0}/{0}", Guid.NewGuid()));
}
// complete fragment; this is the trickiest part,
// since a complex document may have an arbitrarily
// long tail, and cannot be known during file size
// sampling above
writer.WriteEndElement();
writer.Flush();
}
// iteratively reduce document size
// NOTE: XDocument will load full DOM into memory
XDocument document = XDocument.Load("some.data.xml");
XElement root = document.Element("root");
for (; new FileInfo("some.data.xml").Length > FileSizeLimit; )
{
root.LastNode.Remove();
document.Save("some.data.xml");
}
There are ways to improve this; one possibility if memory is a constraint would be to rewrite the iterative bit to take a count of nodes actually written in first pass, then re-write the file less one element, and continue until full document is of desired size.
This last recommendation may be the route to go, especially if you already need to track elements written to resume writing in another file.
Hope this helps!
EDIT
Although intuitive, and easier to implement, I felt it worth investigating the optimization mentioned above. This is what I got.
An extension method that helps write ancestor nodes (ie container nodes, and all other kinds of markup),
// performs a shallow copy of a given node. courtesy of Mark Fussell
// http://blogs.msdn.com/b/mfussell/archive/2005/02/12/371546.aspx
public static void WriteShallowNode(this XmlWriter writer, XmlReader reader)
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
writer.WriteStartElement(
reader.Prefix,
reader.LocalName,
reader.NamespaceURI);
writer.WriteAttributes(reader, true);
if (reader.IsEmptyElement)
{
writer.WriteEndElement();
}
break;
case XmlNodeType.Text: writer.WriteString(reader.Value); break;
case XmlNodeType.Whitespace:
case XmlNodeType.SignificantWhitespace:
writer.WriteWhitespace(reader.Value);
break;
case XmlNodeType.CDATA: writer.WriteCData(reader.Value); break;
case XmlNodeType.EntityReference:
writer.WriteEntityRef(reader.Name);
break;
case XmlNodeType.XmlDeclaration:
case XmlNodeType.ProcessingInstruction:
writer.WriteProcessingInstruction(reader.Name, reader.Value);
break;
case XmlNodeType.DocumentType:
writer.WriteDocType(
reader.Name,
reader.GetAttribute("PUBLIC"),
reader.GetAttribute("SYSTEM"),
reader.Value);
break;
case XmlNodeType.Comment: writer.WriteComment(reader.Value); break;
case XmlNodeType.EndElement: writer.WriteFullEndElement(); break;
}
}
and a method that will perform the trimming (not an extension method, since extending any of parameter types would be a bit ambiguous).
// trims xml file to specified file size. does so by
// counting number of "victim candidates" and then iteratively
// trimming these candidates one at a time until resultant
// file size is just less than desired limit. does not
// consider nested victim candidates.
public static void TrimXmlFile(string filename, long size, string trimNodeName)
{
long fileSize = new FileInfo(filename).Length;
long workNodeCount = 0;
// count number of victim elements in xml
if (fileSize > size)
{
XmlReader countReader = XmlReader.Create(filename);
for (; countReader.Read(); )
{
if (countReader.NodeType == XmlNodeType.Element &&
countReader.Name == trimNodeName)
{
workNodeCount++;
countReader.Skip();
}
}
countReader.Close();
}
// if greater than desired file size, and there is at least
// one victim candidate
string workFilename = filename+".work";
for (;
fileSize > size && workNodeCount > 0;
fileSize = new FileInfo(filename).Length)
{
workNodeCount--;
using (FileStream readFile = new FileStream(filename, FileMode.Open))
using (FileStream writeFile = new FileStream(
workFilename,
FileMode.Create))
{
XmlReader reader = XmlReader.Create(readFile);
XmlWriter writer = XmlWriter.Create(writeFile);
long j = 0;
bool hasAlreadyRead = false;
for (; (hasAlreadyRead) || reader.Read(); )
{
// if node is a victim node
if (reader.NodeType == XmlNodeType.Element &&
reader.Name == trimNodeName)
{
// if we have not surpassed this iteration's
// allowance, preserve node
if (j < workNodeCount)
{
writer.WriteNode(reader, true);
}
j++;
// if we have exceeded this iteration's
// allowance, trim node (and whitespace)
if (j >= workNodeCount)
{
reader.ReadToNextSibling(trimNodeName);
}
hasAlreadyRead = true;
}
else
{
// some other xml content we should preserve
writer.WriteShallowNode(reader);
hasAlreadyRead = false;
}
}
writer.Flush();
}
File.Copy(workFilename, filename, true);
}
File.Delete(workFilename);
}
If your Xml contains whitespace formatting, any whitespace between last remaining victim node and closing container element tag is lost. This can be mitigated by altering the skip clause (moving the j++
statement post skip), but then you end up with additional whitespace. The solution presented above generates a minimal file size replica of source file.