Quickest way to Read Specific Line from Multiple Files one Line at a time

Question

I have a text-based database that represents logs, sorted by timestamp. For testing purposes my database has approximately 10,000 lines but this number can be larger. It is of the format:

primary_key, source_file, line_num
1, cpu.txt, 2
2, ram.txt, 3
3, cpu.txt, 3

I query the database and as I read the results I want to add the actual data to a string which I can then display. Actual data in the above example would be the contents of line 2 from cpu.txt, followed by the contents of line 3 from ram.txt, etc. The line contents can be quite long.

An important note is that the line numbers per file are all in order. That is to say, the next time I encounter a cpu.txt entry in the database it will have line 4 as the line number. However, I might see a cpu.txt entry only after thousands of other entries from ram.txt, harddrive.txt, graphics.txt, etc.

I have thought about using something along the lines of the following code:

StringBuilder odbcResults = new StringBuilder();
OdbcDataReader dbReader = com.ExecuteReader();  // query database
while (dbReader.Read())
{
   string fileName = dbReader[1].ToString(); // source file
   int fileLineNum = int.Parse(dbReader[2].ToString());  // line number in source file

   odbcResults.Append(File.ReadLines(fileName).Skip(fileLineNum).First());
}

However, won't File.ReadLines() want to dispose of its TextReader after every iteration? Not very efficient?

I also had this idea, keeping a StreamReader for every file that I need to read in a Dict:

Dictionary<string, StreamReader> fileReaders = new Dictionary<string, StreamReader>();
StringBuilder odbcResults = new StringBuilder();
OdbcDataReader dbReader = com.ExecuteReader();
while (dbReader.Read())
{
   string fileName = dbReader[1].ToString(); // source file
   int fileLineNum = int.Parse(dbReader[2].ToString());  // line number in source file

   if (!fileReaders.ContainsKey(fileName))
   {
      fileReaders.Add(fileName, new StreamReader(fileName));
   }

   StreamReader fileReader = fileReaders[fileName];
   // don't have to worry about positioning? Lines consumed consecutively
   odbcResults.Append(fileReader.ReadLine());
}
// can't forget to properly Close() and Dispose() of all fileReaders

Do you agree with any of the above examples or is there an even better way?
For the second example I am running on the assumption that the StreamReader will remember its last position - I believe this is saved in the BaseStream.

I have read over How do I read a specified line in a text file?, Read text file at specific line, StreamReader and seeking (the first answer provides a link to a custom StreamReader class with positioning capabilities, but I only know the line number I need to be on, not an offset) but don't think they answer my question specifically.

How many files are there and how large are they? I know it sounds dirty, but just loading the files into memory to start with would be *really* simple... — Jon Skeet, Oct 16 '13 at 19:01
Simple, yes... but it's not the way I want to go about this (nor the way my boss wants me to do it). There can be more than 40 files, and these logs can span multiple days. For a large group of logs... an average of 20 MB per, possibly. Still not sure how the program will be used and abused, but I have the above code and it works aside from `odbcResults` not appending past result #2. Which is strange, since I have another StringBuilder in the same loop that builds the raw database view, and that works flawlessly. — valsidalv, Oct 16 '13 at 19:42
Okay... just think about how cheap 1G of memory is compared with how long it will take you to implement this a more expensive way... — Jon Skeet, Oct 16 '13 at 19:52
Oh, and as an alternative solution - could you load all the database lines into memory instead, then group them by file and process one file at a time? — Jon Skeet, Oct 16 '13 at 19:53
@JonSkeet I actually read through all the files in an earlier step in order to merge them based on timestamp - that's how the database is generated. I need to display them in the order they show up in the database. I go from 40 separate, but related, log files to one 'master view', making it easier to read through and visualize the logs. — valsidalv, Oct 16 '13 at 20:14
Are you going to have everything you want to display in memory at once though? Basically I'm looking for ways you could massively simplify things by not trying to stream *everything*. — Jon Skeet, Oct 16 '13 at 20:16
@JonSkeet Oh, hmm. The user can select the earliest and latest timestamp that they want to view in order to narrow down on an issue, but there really isn't anything preventing someone from viewing all the logs at once. They'll all be displayed in textboxes within the program, if that's what you mean by "in memory at once". I'm trying to cut down on memory usage for processing by using ReadLine instead of ReadAllLines, which you mentioned earlier. — valsidalv, Oct 16 '13 at 20:25
Okay, with that information I think I've got something which should be helpful. — Jon Skeet, Oct 16 '13 at 20:31

score 2 · Accepted Answer · answered Oct 16 '13 at 18:53

If you can guarantee that your line references are strictly sequential in the file (i.e. you always ask for line n+1 after you've asked for line n), then your option of keeping a dictionary of StreamReader instances looks like a good idea.

If you might ask for line n, then line n+x (where x is some positive number >= 1), then I'd wrap that StreamReader in an object that keeps track of the current line number and has a method GetLine(int lineNo) that will return the requested line number. Assuming that the requested line number is greater than the current line number (no reading backwards allowed).

You shouldn't have to worry about positioning. That's handled for you because you're reading sequentially.

Jon Skeet · Answer 2 · 2013-10-16T20:59:52.377

It sounds like you're going to want to have in memory (for display in textboxes) everything that the user selects - so that's a natural boundary for what's feasible anyway. I suggest the following approach:

Read all of the matching metadata (i.e. within the user-specified time range) from the database, into a list. Keep a set of the files we'll need to read.
Create a new array of the same size as the list - this will hold the final data
Go through the required files one at a time:
- Open the file, and remember we're at line 0
- Iterate over the metadata list. For every entry that matches the file we currently have open, read forward to the right line, and populate the final data array element corresponding to the list entry we're looking at. We should only need to read forward, as we're still going in timestamp order.
- Close the file

At that point, the "final data array" should be fully populated. You only need to have one file open at a time, and you never need to read the whole file. I think this is simpler than having a dictionary of open files - aside from anything else, it means you can use a using statement for each file, rather than having to handle all the closing more manually.

It does mean having all the database metadata entries in memory at a time, but presumably each metadata entry is smaller than the result data which you need to have in memory anyway by the end in order to display the result to the user.

Even though you'll be going over the database metadata entries multiple times, that will all happen in memory. It should be insignificant compared with the IO to the file system or the database.

An alternative would be to group the metadata entries by filename as you read them, maintaining the index as part of the metadata entry.

It seems like, because of the multiple iterations over the metadata list, there will be considerably more checks to do (metadata.Length * fileList.Length if I understand your solution correctly). I do like the idea of having `using` statements in my code though. — valsidalv, Oct 16 '13 at 20:55
@valsidalv: That's all in memory though - really, how long would you expect that to take? Even if you have 10,000 entries and 100 files, that's checking a million in-memory items, which will occur in the blink of an eye. It will take *far* longer to fetch the data from disk and the database. I'll make that clear in my answer. — Jon Skeet, Oct 16 '13 at 20:58

Quickest way to Read Specific Line from Multiple Files one Line at a time

2 Answers2