1

I have several log files that I need to parse and combine based on a timestamp. They're of the format:

GaRbAgE fIrSt LiNe
[1124 0905 134242422       ] Logs initialized
[1124 0905 134242568 SYSTEM] Good log entry:
{ Collection:
  ["Attribute"|String]
...
[1124 0905 135212932 SYSTEM] Good log entry:

As you can see I don't need the first line.
I'm currently using some Regex to parse each file: one expression determines if I have a "Logs initialized" line, which I don't care about and discard; another determines if I have a "Good log entry", which I keep and parse; and some of the good log entries span multiple lines. I simply accept the logs that are on multiple lines. However, the code currently also captures the first garbage line because it is indistinguishable from a multi-line log comment from a Regex viewpoint. Furthermore, from what I read Regex is not the solution here (Parsing a log file with regular expressions).

There are many log files and they can grow to be rather large. For this reason, I'm only reading 50 lines at a time per log before buffering and then combining them into a separate file. I loop through every file as long as there are non-null files left. Below is a code example where I replaced some conditions and variables with explanations.

while (there are non-null files left to read)
     {
        foreach (object logFile in logFiles) //logFiles is an array that stores the log names
        {
           int numLinesRead = 0;
           using (StreamReader fileReader = File.OpenText(logFile.ToString()))
           {
              string fileLine;
              // read in a line from the file
              while ((fileLine = fileReader.ReadLine()) != null && numLinesRead < 50)
              {
                 // compare line to regex expressions
                 Match rMatch = rExp.Match(fileLine);
                 if (rMatch.Success)  // found good log entry
                 {
                 ...

How would you skip that first garbage line? Unfortunately it is not as easy as simply consuming a line with ReadLine() because the StreamReader is within a loop and I'll end up deleting a line every 50 others.
I thought of keeping a list or array of files for which I've skipped that first line already (in order to not skip it more than once) but that is sort of ugly. I also thought of getting rid of the using statement and opening the StreamReader up before the loop but I'd prefer not to do that.

EDIT after posting I just realized that my implementation might not be correct at all. When the StreamReader closes and disposes I believe my previous position in the file will be lost. In which case, should I still use StreamReader without the using construct or is there a different type of file reader I should consider?

Community
  • 1
  • 1
valsidalv
  • 761
  • 2
  • 19
  • 33
  • 1
    You're really adding `"Good log entry"` to every valid line in your log-file? Well, disk space is cheap nowadays. – Tim Schmelter Sep 10 '13 at 20:52
  • 1
    I have absolutely no control over how these logs are output. But no, it doesn't actually say "Good log entry" - there is a more specific message :) – valsidalv Sep 10 '13 at 20:53
  • possible duplicate of [Reading large text files with streams in C#](http://stackoverflow.com/questions/2161895/reading-large-text-files-with-streams-in-c-sharp) – User Sep 10 '13 at 20:57
  • @Peter I can definitely use some of those answers to speed up my reading but I wouldn't consider it a duplicate as there's no loop. – valsidalv Sep 10 '13 at 21:08
  • The highest rated answer seemed like it was exactly what you are looking for. That's why I marked it as a possible duplicate. – User Sep 11 '13 at 02:04

2 Answers2

2

You could just use something like this:

Instead of this:

using (StreamReader fileReader = File.OpenText(logFile.ToString()))
{
    string fileLine;
    // read in a line from the file
    while ((fileLine = fileReader.ReadLine()) != null && numLinesRead < 50)
    {

do this:

int numLinesRead = 0;

foreach (var fileLine in File.ReadLines(logFile.ToString()).Skip(1))
{
    if (++numLinesRead >= 50)
        break;
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • in the latter case - this is not a good general solution (but might be ok for small files) as reading all the file into memory to skip one line is wasteful imho. – Preet Sangha Sep 10 '13 at 20:59
  • @PreetSangha It's *not* doing that. `ReadLines` will stream the file's data, just like the previous solution is; it's just not using a terrible API to do it. Had he used `ReadAllLines` instead, then yes, that would be bad, and I'd yell at him until he changed it, but he didn't. – Servy Sep 10 '13 at 21:02
  • Doesn't the `Skip(1)` statement get hit every time a log is read (the foreach is inside a while loop)? It will skip a real log line after the first iteration - correct me if I'm wrong. – valsidalv Sep 11 '13 at 14:40
  • @valsidalv It only skips the very first line of each log file, as you wanted. – Matthew Watson Sep 11 '13 at 15:27
  • Ok, I understand. I will still need a method of remembering my previous position in the log file for when I return to it as @Tony Hopkinson described in his answer. Thank you! – valsidalv Sep 11 '13 at 16:12
1

Add another parameter to the method for the position in the file. First time in it's zero, and you can consume the line before you go into the loop. After that you can use it to position the stream where that last one left off.

e.g

long position = 0;
while position >= 0
{
  position = ReadFiftyLines(argLogFile,0);
}
public long ReadFiftyLines(string argLogFile, long argPosition)
{
   using(FileStream fs = new FileStream(argLogFile,FileMode.Open,FileAccess.Read))
   {
       string line = null;
       if (argPosition == 0)
       {
          line = reader.Readline();
          if (line == null)
          {
             return -1; // empty file
          }
       }
       else
       { 
          fs.Seek(argPosition,SeekOrigin.Begin);
       }
       StreamReader reader = new StreamReader(fs);
       int count = 0;
       while ((line = reader.ReadLine() != null) && (count < 50))
       {
          count++;
          // do stuff with line
       }
       if (line == null)
       {
          return -1; // end of file
       }
       return fs.Position;
   }
}

or somesuch.

Tony Hopkinson
  • 20,172
  • 3
  • 31
  • 39
  • This will mean that, in order to position the stream where I left off, I'd have to run a for-loop with many ReadLine()s as there's no skip for StreamReader. Certainly a solution but this may be slow when I have thousand-line log files. – valsidalv Sep 10 '13 at 21:04
  • Eh? Create the FileStream, position it with seek, create the StreamReader, it starts from the current position. StreamReader also has a property called BaseStream so you could seek on that, if you are only passing the StreamRader instance into the method. – Tony Hopkinson Sep 10 '13 at 21:45
  • Thanks for the clarification, my C# knowledge is pretty basic. – valsidalv Sep 11 '13 at 14:15