0

I have a large 20 GB text file with entries resembling entry1MainText:entry1Name, line separated.

I need to see if a property of an object matches entry1MainText in any of these lines. So far I have the below code (ref Reading large text files with streams in C#) that reads a line of the file and performs a foreach for said object property. I realise this is likely not the most efficient way.

string file = @"C:\test.txt";

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
    string line;
    while ((line = sr.ReadLine()) != null)
    {
        foreach (UsrFile usrF in rawUsrSorted)
        {
            if (line.Contains(usrF.Prop1))
            {
                gridMain.Rows.Add(usrF.Prop1, usrF.Prop2);
            }
        }
    }

}

I do have the benefit of having enough RAM to read the file into memory and parsing there if this would be of benefit, i.e. I have looked a little into MemoryMappedFile and wonder if this might be of use here.

shearlynot
  • 81
  • 1
  • 11
  • 2
    I would think to some kind of indexing first, but that depends on how the data are written on the file (also assuming the file is static). – Mario Vernari Feb 22 '21 at 09:03
  • Did you try this one https://stackoverflow.com/a/25936140/5555803? – Orkhan Alikhanov Feb 22 '21 at 09:04
  • @MarioVernari yes the file is static, nice idea. – shearlynot Feb 22 '21 at 09:05
  • There's an overload of `FileStream` that accepts a buffer size - using `BufferedStream` instead of that will actually make it slower, – Matthew Watson Feb 22 '21 at 09:06
  • 1
    What is `gridMain` by the way? If it's a `DataGridView` it is likely updating itself whenever you add a row to it. It would be much more efficient to use `.AddRange()` to update it in a oner. – Matthew Watson Feb 22 '21 at 09:12
  • I see you're using `Contains` to check if the line matches - but from the description it seems like you want to check `StartsWith(usrF.Prop1 + ":")`. Do you actually need to check if the text _contains_ the property, or do you want to check if they are equal? – gnud Feb 22 '21 at 09:15
  • If the file is static then do a transformation to a data format or storage format/type where searching is quicker. Any optimizations you do to the above code will be minimal compared to a better method for doing it. – Lasse V. Karlsen Feb 22 '21 at 09:15
  • @MatthewWatson good guess, yes it is. I'll look into this, thanks. – shearlynot Feb 22 '21 at 09:23

1 Answers1

1
  1. instantiate a new FileStream to get access to the buffer and flags
  2. Adjust the buffer size, for SSD this can be quite large, I have chosen 1024 * 1000 (see what works for your drive)
  3. Set the FileOptions.SequentialScan flag

Indicates that the file is to be accessed sequentially from beginning to end. The system can use this as a hint to optimize file caching. If an application moves the file pointer for random access, optimum caching may not occur; however, correct operation is still guaranteed. Specifying this flag can increase performance in some cases.

  1. Split the line
  2. Use a Dictionary

Example

var dict = rawUsrSorted
    .ToDictionary(x => x.Prop1, x => x.Prop2);

using var fs = new FileStream(
    file, 
    FileMode.Open, 
    FileAccess.Read, 
    FileShare.ReadWrite, 
    1024 * 1000, 
    FileOptions.SequentialScan);

using var sr = new StreamReader(fs);

string line;
while ((line = sr.ReadLine()) != null)
{
   var prop = line[..line.IndexOf(":")];
   if (dict.TryGetValue(prop, out var prop2))
      gridMain.Rows.Add(prop, prop2); 
}

Note : This is completely untested, may contain any number of typos syntax errors or mistakes and lacks suitable error checking and fault tollerance

Also note : You should really use a database, scanning a 20 gig file is extremely inefficient compared to an indexed table.

TheGeneral
  • 79,002
  • 9
  • 103
  • 141