4

I have a big text file (>2gb). I am currently reading the file in chunks of 1kb using Filestream. In each chunk, I count the number of lines and using this count I have found the position in the file where the line has to be deleted.

For example, if the byte position of the line I want to delete is 4097, is there a way in C# I can delete characters in the line that starts from 4097 until I hit the \n character.

I was looking at Filestream.Seek() method to directly go to the delete position. But, I am not sure how to proceed further.

Since, it is a big file I do not want to create another file which would consume a lot of space in system and also memory. Is there an efficient way that I can use to delete the line without creating a new file.

Any suggestions and help would be appreciated.

Thanks in advance!

bedinesh
  • 43
  • 3
  • possible duplicate of [Efficient way to delete a line from a text file](http://stackoverflow.com/questions/532217/efficient-way-to-delete-a-line-from-a-text-file) and http://stackoverflow.com/questions/668907/how-to-delete-a-line-from-a-text-file-in-c – Darin Dimitrov Mar 03 '13 at 22:48
  • not really if you are deleting a line you will ultimately need to write another file or "move" the subsequent lines to earlier in your file and then trim the end. – Paul Farry Mar 03 '13 at 23:01
  • you could logically delete the lines, then when you insert a new line just replace the first line that you marked for deletion, or if none are found place it at the end. – Steve's a D Mar 03 '13 at 23:40

2 Answers2

1

I feel like the only way to shorten a file is really copy some of it, then skip, then copy the rest. If you really need to do it in place, you could opt for some form of logical deletion. For instance you could use only LF to encode new lines in your text file (which is not the default on Windows, which uses instead a pair CR-LF), which most likely would need to be 8 bit ASCII and opt for something like this:

    public static void LogicalEraseLine(string filename, int toDel)
    {
        FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.ReadWrite);

        fs.Seek(toDel, SeekOrigin.Current);
        int c;

        while ((c = fs.ReadByte()) != -1)
        {
            if (c == '\n')
            {
                break;
            }
            else
            {
                fs.Seek(-1, SeekOrigin.Current);
                fs.WriteByte((byte)'\n');
            }
        }

        fs.Close();
    }

Note that toDel is the index of the first character to delete, not the index of the line to delete. This code simply replaces all characters between the one at toDel and the end of the line with an equal number of empty lines. Then you would need another function to copy the file to another file, but skipping all empty lines. You could do this cleanup at any convenient time in the future. Your actual algorithm would need to be able to cope with lots of blank lines in the file though. Also, you are right that you should read the file in chunks, but the basic idea shown in this example could be applied also in that case.

Edit You could use this function to erase logically deleted lines:

    public static void Cleanup (string filename)
    {
        FileStream input = new FileStream(filename, FileMode.Open, FileAccess.Read);
        FileStream output = new FileStream(filename + ".tmp", FileMode.Create, FileAccess.Write);

        bool emptyLine = true;
        int c;

        while ((c = input.ReadByte()) != -1)
        {
            if (c == '\n')
            {
                if (!emptyLine)
                {
                    output.WriteByte((byte)c);
                    emptyLine = true;
                }
            }
            else
            {
                output.WriteByte((byte)c);
                emptyLine = false;
            }
        }

        input.Close();
        output.Close();

        File.Delete (filename);
        File.Copy(filename + ".tmp", filename);
        File.Delete(filename + ".tmp");
    }

Also, when deleting files it's a good idea to be very careful and double check everything that may go wrong.

Edit The first algorithm was kinda meaningless because I was still reading the entire file; now it makes sense.

damix911
  • 4,165
  • 1
  • 29
  • 44
  • That was a good idea damix911. I implemented this and it works perfect! Apologies that I am currently not able to give a vote up for your answer, because my reputation is low as I just set up an account. Will vote up when I have the enough reputation. Thanks! – bedinesh Mar 04 '13 at 21:59
  • You are welcome, I'm glad it helped :-) Thank you for your question, these kind of topics on persistence are always very interesting. – damix911 Mar 05 '13 at 11:52
0

The most efficient way to handle large files is to use Memory-Mapped files. The benefit of it is that you don't need to read the whole file, modify it and then write again, you can just modify the interesting part of data. Set the 4097 as an offset and load some 100 KBytes. This example from MSDN should help you start.

long offset = 0x10000000;  
long length = 0x20000000; // 512 megabytes 

// Create the memory-mapped file. 
using (var mmf = MemoryMappedFile.CreateFromFile(@"c:\ExtremelyLargeImage.data", FileMode.Open,"ImgA"))
{
    // Create a random access view, from the 256th megabyte (the offset) 
    // to the 768th megabyte (the offset plus length). 
    using (var accessor = mmf.CreateViewAccessor(offset, length))
    {
        int colorSize = Marshal.SizeOf(typeof(MyColor));
        MyColor color;

        // Make changes to the view. 
        for (long i = 0; i < length; i += colorSize)
        {
            accessor.Read(i, out color);
            color.Brighten(10);
            accessor.Write(i, ref color);
        }
    }
}
VladL
  • 12,769
  • 10
  • 63
  • 83
  • Vlad, that seems to be a good suggestion. I will sure look into it and see if I could implement that. Will vote your answer when I have more reputation. Thanks! – bedinesh Mar 04 '13 at 22:01