I'm trying to determine what is the most efficient way of parsing .svclog files. To give you more context, the .svclog files I'm dealing with look like what's in http://msdn.microsoft.com/en-us/library/aa751795.aspx. The tracing logic creates <E2ETraceEvent/>
elements and puts them all on one single line in a .svclog file, so you end up with 10s of megabytes worth of single line XML, such as:
<E2ETraceEvent [...]</E2ETraceEvent><E2ETraceEvent [...] </E2ETraceEvent>...
What is my most efficient way of reading one <E2ETraceEvent/>
element at a time from this giant line? I know there are tools out there that can basically indent the XML for you and save the changes either into the same file, or a separate file altogether. That's an additional step I would rather skip, since performance will be quite important given the # of these files I might have to process. I don't want to have to indent a hundred fat files before I can even start processing them.
I could load the entire file in memory and treat it as a string (they're capped at 30 megs in my case), but I'm envisioning implementing some kind of "log-merging" logic in the future where I might need to stitch together hundreds of these files, and so loading them all in memory at once is just not going to happen.
I could probably use a regex with "<E2ETraceEvent.*?</E2ETraceEvent>"
and advance one element at a time (is that efficient at all?).
I could actually manually implement a state-machine that would read in one character at a time. This already sounds bad :)
Tons of options, but I'm looking for something truly clean and elegant.
PS. Is it really common to deal with single-line files in parsing? I haven't done too much parsing work before, but almost all of the tools I've worked with seem to rely on reading x amount of lines at a time. All of that becomes completely useless the instant you don't have a single newline in the entire file.