0

I'm not asking about only reading a large file or reading/writing a xml file which I know there are Xml related classes for handling that. Let me give a more specific description of what I'm trying to do:

I have a very large file size that is about 10TB, which I can not load into memory at once. Meaning, I could not do as below:

        var lines = File.ReadAllLines("LargeFile.txt");
        var t = 1 << 40;
        for(var i= t; i< 2 * t; i++)
        {
            lines[i] = someWork(); //
        }

        File.WriteAllLines("LargeFile.txt", lines);

I want to read and update lines in a range between 1 and 2TB.

What's the best approach doing this? Examples of .Net classes or 3rd party libraries would be helpful. I'm also interested in how other languages handle this problem as well.


I tried David's suggestion by using position. However, i feel it doesn't work. 1. the size of FileStream seems fixed, I can modify the bytes, but it will overwrite byte by byte. it my newdata size is large/less than original line of data. I won't be able to update correctly. 2. I didn't find a O(1) way to convert line num to position num. it still take me O(n) to find the position.

below is my try

    public static void ReadWrite()
    {
        var fn = "LargeFile.txt";
        File.WriteAllLines(fn, Enumerable.Range(1, 20).Select(x => x.ToString()));

        var targetLine = 11; // zero based
        long pos = -1;
        using (var fs = new FileStream(fn, FileMode.Open, FileAccess.Read, FileShare.Read))
        {
            while (fs.Position != fs.Length)
            {
                if (targetLine == 0)
                {
                    pos = fs.Position +1; // move pos to begin of next line;
                }

                // still take average O(N) time to scan whole file to find the position.
                // I'm not sure if there is better way. to redirect to the pos of x line by O(1) time.
                if (fs.ReadByte() == '\n')
                {
                    targetLine--;
                }
            }
        }

        using (var fs = new FileStream(fn, FileMode.Open, FileAccess.ReadWrite))
        {
            var data = Encoding.UTF8.GetBytes("999"); 
            fs.Position = pos;
            // if the modify data has differnt size compare to the current one
            // it will overwrite next lines of data
            fs.Write(data, 0, data.Length);
        }
    }
Plewistopher
  • 73
  • 1
  • 1
  • 7
Huan Jiang
  • 247
  • 4
  • 13
  • Open 2 streams, one for reading one for writing. Read in chunks and write back in chunks. That is probably the only helpful advise you are going to get with such a broad and nondescript question. – Igor Jun 29 '17 at 19:20
  • Just to clarify something, you say the file size is about 10T, is this "TB" as in terabyte? If so, then is 1T to line 2T also TB? Meaning you want to process from whichever line is at/around the 1TB mark and then 1TB worth of lines forward? Or is "T" in the line range simply a number, so the loop you have in your example code is correct, other than you not being able to read all the lines into memory? – Lasse V. Karlsen Jun 29 '17 at 19:38
  • My code is a example, T means TB. The file need to process is at least 10TB size. the range length could be large or small like just 1. However the range should be start from any line of the file. I already mentioned I can't do with the example code. – Huan Jiang Jun 29 '17 at 19:57
  • While it is possible to change individual bytes inside large files; this is very dangerous if the changes are different sizes, in different locations, or if there are any file or hardware errors while you are making the change. Any of those things can corrupt your entire file. In general, the best way is to make the changes in a duplicate file and, if it works, swap the file locations. [Safe Stream update of file](https://stackoverflow.com/a/327033/22437) shows an example. – Dour High Arch Jun 29 '17 at 20:32
  • Dour, you bring a good point, it is very easy to corrupt the file. Real and write line by line approach works. however it is too expensive if I only need to update a few line of data in a large file. – Huan Jiang Jun 29 '17 at 21:21
  • Voting to close as “too broad”. Updating multi-TB files is a very difficult problem. Providing a good solution will require much more information about your data and usage, and is likely beyond the scope of a good SO answer. Flat files and streams are about the worst way to handle this. Consider using a database; they are designed to update large amounts of data correctly. Which “database” would be best depends on too much you have not told us. – Dour High Arch Jun 29 '17 at 23:50
  • OK, I'll leave it open for couple of days, if there is no more suggestion, I'll close this question. – Huan Jiang Jun 30 '17 at 00:12
  • it is said, question has been hold. not sure how to close it, I have pick up the only answer as solution. Thanks, everyone for giving the suggestion. – Huan Jiang Jun 30 '17 at 15:58
  • If you want to close a question you shouldn't accept an answer. Also, the answer you accepted won't allow you to update or change the file. – Dour High Arch Jun 30 '17 at 16:58
  • I undo the accept, however still cannot find place to close it. I can only vote for delete. – Huan Jiang Jul 01 '17 at 01:03

1 Answers1

0

You don't have to read through the first 1TB to modify the middle of the file. FileStream supports random access. EG

    string fn = @"c:\temp\huge.dat";
    using (var fs = new FileStream(fn, FileMode.Open, FileAccess.Read, FileShare.Read))
    {

        fs.Position = (1024L * 1024L * 1024L);
        //. . .


    }

Once you reposition the filestream you can read and write at the current location, or open a StreamReader to read text from the file. You must, of course, ensure that you move to a byte offset that begins a character in the file's encoding.

David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67