I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).
The lines can be of a different length.
In order to reduce GC
and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => position
i.e:
// maps each line to it's corresponding fileStream.position in the file
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
- go through the entire file
- when detect a
new line
incrementlineCount
and add thefileStream.Position
to the_lineNumberToFileStreamPositionMapping
We then use an API similar to:
public void ReadLine(int lineNumber)
{
var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
//... set the stream position, read the byte array, convert to string etc.
}
This solution is currently providing a good performance however there are two things I do not like:
- Since I do not know the total number of lines in the file, I cannot preallocate an
array
therefore I have to use aList<int>
which has the potential inefficiency of resizing to double of what I actually need; - Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.
Any ideas are very much appreciated.