0

I am trying to modify a file-stream inline as the file has the potential to be very large and I don't want to load it into memory. The piece of information I'm editing will always be the same length so in theory I can just swap the content out using a stream reader but it doesn't seem to be writing to the correct place

I have created a section of code that using a stream reader will read line by line until it finds a regex match and will then attempt to swap the bytes out with the edited line. The code is as follows:

private void UpdateFile(string newValue, string path, string pattern)
{
    var regex = new Regex(pattern, RegexOptions.IgnoreCase);
    int index = 0;
    string line = "";

    using (var fileStream = File.OpenRead(path))
    using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
    {

        while ((line = streamReader.ReadLine()) != null)
        {
            if (regex.Match(line).Success)
            {
                break;
            }
            index += Encoding.Default.GetBytes(line).Length;
        }
    }
    if (line != null)
    {
        using (Stream stream = File.Open(path, FileMode.Open))
        {
            stream.Position = index + 1;
            var newLine = regex.Replace(line, newValue);
            var oldBytes = Encoding.Default.GetBytes(line);
            var newBytes = Encoding.Default.GetBytes("\n" + newLine);
            stream.Write(newBytes, 0, newBytes.Length);
        }
    }

}

The code almost works as expected, it inserts the updated line but it always does it a little early, just how early varies slightly based on the file I'm editing. I expect it is something to do with the way I am managing the stream position but I don't know the correct way to approach this.

Unfortunately the exact files I'm working on are under NDA.

The structure is as follows though: A file will have an unkown amount of data followed by a line of a known format, for example: Description: ABCDEF I know the portion that follows "Description: " will always be 6 characters, so I do a replace on the line to replace with, for example, UVWXYZ. The problem is that for example if a file read as

'...
UNIMPORTANT UNKNOWN DATA
DESCRIPTION: ABCDEF
MORE DATA
...'

it will come out as something like

'...
UNIMPORTANT UNKNOWN DDESCRIPTION: UVWXYZDEF
MORE DATA
...'
Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
Langynom
  • 79
  • 5
  • It is very unclear what you are trying to do. Please give us examples of the files, wich place you want to change and into what you want to change it. – Christopher Aug 04 '19 at 22:08
  • See my edit for file structure – Langynom Aug 04 '19 at 22:48
  • It would be awesome if you could share a [mcve]. The [mcve] must be able to be copied and pasted into a console app **and run without modification** (this may involve you attaching an example file to your question - it does not have to be a real customer file, just one that demonstrates the issue). It must demonstrated the issue you have, and your question must clearly detail what the [mcve] is currently doing - and what you want it to be doing instead. – mjwills Aug 04 '19 at 23:33
  • Have you considered opening the file only once (rather than twice, once for read and once for write)? https://stackoverflow.com/questions/33633344/read-and-write-to-a-file-in-the-same-stream – mjwills Aug 04 '19 at 23:50

2 Answers2

1

I think the problem here is that you are not considering the line feed ("\n") for each line you are getting and therefore your index is incorrectly setting the position of your stream. Try the following code:

private void UpdateFile(string newValue, string path, string pattern)
{
   var regex = new Regex(pattern, RegexOptions.IgnoreCase);
   int index = 0;
   string line = "";

   using (var fileStream = File.OpenRead(path))
   using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
   {

       while ((line = streamReader.ReadLine()) != null)
       {
           if (regex.Match(line).Success)
           {
            break;
           }
           index += Encoding.ASCII.GetBytes(line + "\n").Length;
       }
   }
   if (line != null)
   {
       using (Stream stream = File.Open(path, FileMode.Open))
       {
           stream.Position = index;
           var newBytes = Encoding.Default.GetBytes(regex.Replace(line + "\n", newValue));
           stream.Write(newBytes, 0, newBytes.Length);
       }
   }
}
OliB
  • 104
  • 1
  • 5
  • 1
    This is perfect! Thank you, I hadn't realised that the ReadLine function didn't also return the newline character. – Langynom Aug 05 '19 at 21:25
0

In your example, you are "off" by 4 Characters. Not quite the common "off by one error", but close. But maybe a different pattern would help the most?

Programms nowadays rarely work "on the file" like that. There is just too much to go wrong, all the way to a power loss mid-process. Instead they:

  • create a empty new file at the same location. Often temporary named and hidden.
  • write the output to the new file
  • Once you are done and eveyrthing is good - all the caches are flushed and everything is on the disk (done by Stream.Close() or Dispose()) - just replace the old file with the new file using the OS move operation.

The advantage is that it is impossible to have data-loss. Even if the computer looses power mid-operation, at tops the temporary file is messed up. You still got the orignal file and yoou can just delte the temporary file and restart the work from scratch if you need too. Indeed recovery only makes sense in rare cases (Word Processors)

The replacement of old file by new file is done with a move order. If they are on the same partition, that is literally just a rename operation in the Filesytem. And as modern FS are basically designed like a topline, robust relational Databases there is no danger in this.

You can find that pattern in everything from your Word Porcessor of choice, to backup programms, the download manager of Firefox (as you might be overriding a file that was there befroe) and even zipping programms. Everytime you got a long writing phase and want to minimize the danger, it is to go to pattern.

And as you can work entirely in memory without having to deal with moving around the read/write head, it will get around your issue too.

Edit: I made some source code for it from memory/documentation. Might contain syntax errors

string sourcepath; //containts the source file path, set by other code
string temppath; //containts teh path of the tempfile. Should be in the same folder, and thus same partiion

//Open both Streams, can use a single using for this
//The supression of any Buffering on the output should  be optional and will be detrimental to performance
using(var sourceStream = File.OpenRead(sourcepath), 
  outStream = File.Create(temppath, 0, FileOptions.WriteThrough )){

    string line = "";

    //itterte over the input
    while((line = streamReader.ReadLine()) != null){
        //do processing on line here

        outStream.Write(line);
    }
}

//replace the files. Pretty sure it will just overwrite without asking File.Move(temppath, sourcepath);

Christopher
  • 9,634
  • 2
  • 17
  • 31