Large one-line XML file parsing: most efficient approach?

Question

I'm trying to determine what is the most efficient way of parsing .svclog files. To give you more context, the .svclog files I'm dealing with look like what's in http://msdn.microsoft.com/en-us/library/aa751795.aspx. The tracing logic creates <E2ETraceEvent/> elements and puts them all on one single line in a .svclog file, so you end up with 10s of megabytes worth of single line XML, such as:

<E2ETraceEvent [...]</E2ETraceEvent><E2ETraceEvent [...] </E2ETraceEvent>...

What is my most efficient way of reading one <E2ETraceEvent/> element at a time from this giant line? I know there are tools out there that can basically indent the XML for you and save the changes either into the same file, or a separate file altogether. That's an additional step I would rather skip, since performance will be quite important given the # of these files I might have to process. I don't want to have to indent a hundred fat files before I can even start processing them.

I could load the entire file in memory and treat it as a string (they're capped at 30 megs in my case), but I'm envisioning implementing some kind of "log-merging" logic in the future where I might need to stitch together hundreds of these files, and so loading them all in memory at once is just not going to happen.

I could probably use a regex with "<E2ETraceEvent.*?</E2ETraceEvent>" and advance one element at a time (is that efficient at all?). I could actually manually implement a state-machine that would read in one character at a time. This already sounds bad :)

Tons of options, but I'm looking for something truly clean and elegant.

PS. Is it really common to deal with single-line files in parsing? I haven't done too much parsing work before, but almost all of the tools I've worked with seem to rely on reading x amount of lines at a time. All of that becomes completely useless the instant you don't have a single newline in the entire file.

A proper XML parser won't care whether or not there are linebreaks. Performance should be the same. — porges, Mar 15 '12 at 00:01
The xml parsers I used complain about there being "multiple roots" in the XML file. For example, in Powershell, [xml]$xmlFile = get-content myfile.svclog will throw the following: Cannot convert value to type "System.Xml.XmlDocument". Error: "There are multiple root elements. Line 1, position 632." — Alexandr Kurilin, Mar 15 '12 at 00:07
Ah, so what you've shown in your example is *exactly* what you're getting. When it's like that it isn't actually valid XML, as there can only be one top-level element ("root") in XML. You'll have to wrap it in a containing element, something like `...` — porges, Mar 15 '12 at 00:11
That actually works great, however it has downsides: 1) I have to modify the files. If I have read-only access to that folder, I'd have to make a local copy of all of the contents over the network (could be gigabytes of data) just so I can add an extra tag into the file. 2) I expect the tag addition step (create file with open tag, append the xml file, append closing tag) to take a long time for large files. — Alexandr Kurilin, Mar 15 '12 at 01:29
For something like a log merger tool, I will never have to read more than one element at a time per file, since I'm processing them in order they were created, and that order is already correct for each file. As long as I'm able to parse one element per file (this is the painful part I'm trying to solve) I can get the date and sort each of the elements that are currently available to me by that date, and perform merging by appending the earliest element to the merged file I'd be building. — Alexandr Kurilin, Mar 15 '12 at 01:33
@glich, is your software built in .NET? How bad a hit is it to read the file into a string, apply the beginning and end tags to ensure a single root, *then* parse the string using an XML parser? — devuxer, Mar 15 '12 at 01:33
@DanM, yes it's .NET. I actually thought of what you proposed, but if I have tons of these log files I might want to merge, I'll have to have each one of them in memory at once, which will make the tool scale very poorly. I'm trying to keep the footprint low if it's avoidable. — Alexandr Kurilin, Mar 15 '12 at 01:37
@Porges: in any case I agree with you that the way my log files are being built right now is busted, there should be a root element with individual events as its children. I didn't write that myself, but it'll be good to look into. Given that I never append to a file after I'm done writing to it, there's no reason why the trace writer object cannot append a closing tag for the root element in some kind of finally block, so even if the application explodes dramatically (minus the PC losing power, heh) it should still be able to keep the XML well-formatted. — Alexandr Kurilin, Mar 15 '12 at 01:43
@glitch: actually .NET supports parsing fragments like this directly - i'll post an example — porges, Mar 15 '12 at 01:48
@glitch, I'd check out this question: http://stackoverflow.com/questions/398378/get-last-10-lines-of-very-large-text-file-10gb-c-sharp. The question deals with lines, but the accepted answer looks like it accepts input for any delimiter/token, so it should still be relevant to your situation. Be sure to read all the pitfalls in the comments. — devuxer, Mar 15 '12 at 01:51

Luiz Felipe · Answer 1 · 2012-04-18T17:41:54.753

If someone are having problem with broken traces, i made this powershell script.

function process-event
{
    $dest = $args[1]
    Get-ChildItem $args[0] | 
        Select-String "([<]E2ETraceEvent.*?(?=[<]E2ETraceEvent))" -AllMatches |
            ForEach-Object { $matches = $_.Matches; 
                foreach ($m in $matches) {  
                    Add-Content -Path $dest -Value $m.Value } };
}

function process-log
{
    '<?xml version="1.0" encoding="utf-8"?><Tracing>' | Out-File $args[1]
    process-event $args[0] $args[1]
    '</Tracing>' | Out-File $args[1] -append
}

process-log .\the_log.svclog .\the_log_fix.svclog

Updated! This is not much fast, i needed only for 300mb files, but it will fix them and not burn all the RAM.

score 2 · Accepted Answer · answered Mar 15 '12 at 01:47

2

Since you have what are basically document fragments rather than normal documents, you could use the underlying XmlReader classes to process it:

// just a test string... XmlTextReader can take a Stream as first argument instead
var elements = @"<E2ETraceEvent/><E2ETraceEvent/>";

using (var reader = new XmlTextReader(elements, XmlNodeType.Element, null))
{
    while (reader.Read())
    {
        Console.WriteLine(reader.Name);
    }
}

This will read the XML file one element at a time, and won't keep the whole document in memory. Whatever you do in the read loop is specific to your use case :)

answered Mar 15 '12 at 01:47

porges

30,133
4
83
114

Gotcha, this sounds like what I was looking for. I'll test that out and confirm if it works as intended! Thanks! – Alexandr Kurilin Mar 15 '12 at 04:41
Awesome, that worked!! If anybody's curious, for each read I had to ensure that the name was E2ETraceEvent and that the nodetype was Element (as opposed to EndElement), and then I could just pass the content of ReadOuterXml() to an XmlDocument.Load(). – Alexandr Kurilin Mar 15 '12 at 18:14

Large one-line XML file parsing: most efficient approach?

2 Answers2