6

I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).

The lines can be of a different length.

In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => positioni.e:

// maps each line to it's corresponding fileStream.position in the file    
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
  1. go through the entire file
  2. when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping

We then use an API similar to:

public void ReadLine(int lineNumber)
{
     var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
     //... set the stream position, read the byte array, convert to string etc.
}

This solution is currently providing a good performance however there are two things I do not like:

  1. Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need;
  2. Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.

Any ideas are very much appreciated.

Community
  • 1
  • 1
MaYaN
  • 6,683
  • 12
  • 57
  • 109
  • Why is the index 150gb? 5 million ints is under 20mb of raw storage, so where are you getting that value? – DavidG Apr 13 '16 at 00:08
  • That's what the profiler is showing me but then again I have not dug deeper. That aside, 20mb would be the ideal scenario however in reality it could be double that due to the resizing logic of the `List` – MaYaN Apr 13 '16 at 00:13
  • Perhaps you should use a plain old array then. Not sure a `List`actually gives you anything useful here. Once you've built the list, convert it to array and throw the list way. – DavidG Apr 13 '16 at 00:15
  • The problem is the files do actually grow while I am reading them so `List` allows me to add the new appended lines to the index. – MaYaN Apr 13 '16 at 00:15
  • Ah, I see. Well you can still use an array along with `Array.Resize(ref yourArray, newSize)`. You will have to do your own work to determine if the work to do that is worth it over the speed increase gained from arrays. My limited testing suggests arrays will be twice as fast. – DavidG Apr 13 '16 at 00:34
  • Agreed, arrays will be much faster, I will try your suggestion with the one posted by smead. – MaYaN Apr 13 '16 at 06:37

1 Answers1

4
  1. Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.

  2. If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.

smead
  • 1,768
  • 15
  • 23