2

I want to read a CSV file which can be at a size of hundreds of GBs and even TB. I got a limitation that I can only read the file in chunks of 32MB. My solution to the problem works kinda slow and I wanted to ask if you know of a better solution:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
    {
        var stream = new StreamReader(new MemoryStream(buffer));
        while ((line = stream.ReadLine()) != null)
        {
            //process line
        }

    }
}

EDIT: I'm adding a restriction saying that I cannot read the file line by line.

Chris
  • 1,533
  • 2
  • 17
  • 33
CodeMonkey
  • 11,196
  • 30
  • 112
  • 203
  • 2
    Have you tried `File.ReadLines` performance? – Konrad Kokosa Jan 13 '14 at 19:10
  • @KonradKokosa could you give an explanation and example in an answer? to be honest I'm not that experienced with file handling so I'm not sure what's the difference between File.ReadLines and the way I did it – CodeMonkey Jan 13 '14 at 19:14
  • 2
    BTW: according to your code, a line may be splitted in two chunks. – L.B Jan 13 '14 at 19:14
  • @L.B what do you mean? – CodeMonkey Jan 13 '14 at 19:15
  • @YonatanNir Take a look at the documentation, use Google to find examples, etc. Come here if, after doing some basic research you're still struggling to use it. I *highly* doubt that would happen though. It's use is *far* easier than the alternatives, such as what you're trying to do. – Servy Jan 13 '14 at 19:15
  • @YonatanNir a line's first half may be in one chunk and the other half in the next chunk. – L.B Jan 13 '14 at 19:16
  • Where do you think is the bottleneck? Reading from the file? or creating a new Stream from the buffer? or is it expected given that the files are so huge? One option could be to use multithreaded read where one thread reads while other one performs the rest of the stuff and vice versa. – Abhinav Jan 13 '14 at 19:16
  • Underneath `File.ReadLines` is calling `stream.ReadLine` sequentially but maybe it will be faster than overhead from buffering paradoxically. – Konrad Kokosa Jan 13 '14 at 19:17
  • BufferedStream is redundant with FileStream. – Adam Mills Jan 13 '14 at 19:17
  • @Servy the question is this: is File.ReadAllLines() just easier to write, or is it also more efficient? – CodeMonkey Jan 13 '14 at 19:18
  • Why do you need to create 32MB chunks? Why not use StreamReader directly over the file? – Wagner DosAnjos Jan 13 '14 at 19:18
  • @YonatanNir Who said anything at all about `ReadAllLines`? The comment was to use `File.ReadLines`, which is an entirely different method. It is going to stream the data, rather than eagerly loading it into memory, it will have a convenient to use syntax, and is almost certainly sufficiently performant. – Servy Jan 13 '14 at 19:20
  • @wdosanjos that's a limitation I got.. I can't assume I can fit the entire file to the memory – CodeMonkey Jan 13 '14 at 19:20
  • 4
    There's no need to jump through all those hoops. There is a [StreamReader constructor](http://msdn.microsoft.com/en-us/library/2a2z3f9a.aspx) that lets you specify the buffer size. Also, consider defining your constant as `const int MAX_BUFFER = 32 * 1024 * 1024;` That's a whole lot more clear than the magic number. By the way, I've found that optimum buffer size is usually around 64 kilobytes. A bigger buffer just adds needless overhead, and usually makes your program slower. – Jim Mischel Jan 13 '14 at 19:22
  • @YonatanNir But you don't need to specifically use a 32MB sized buffer; you can use whatever makes the most sense. If you just read one line at a time it will be automatically buffered to a sensible buffer size; there is no need to buffer a second time. – Servy Jan 13 '14 at 19:22
  • Could a line be larger than 32MB? – Andrew Morton Jan 13 '14 at 19:22
  • possible duplicate of [What's the fastest way to read a text file line-by-line?](http://stackoverflow.com/questions/8037070/whats-the-fastest-way-to-read-a-text-file-line-by-line) – Richard Jan 13 '14 at 19:23
  • @Servy I want to add a restriction that I can't read the entire file line by line – CodeMonkey Jan 13 '14 at 20:53
  • @YonatanNir And why is that? What's wrong with doing that? – Servy Jan 13 '14 at 20:55
  • @Servy since this is the restriction I got: "You cannot read the entire file, one line at a time. Maximum size of data read, except for the actual records that need to be returned for the query, cannot exceed 32MB." And I'm just not sure how to do it efficiently. – CodeMonkey Jan 13 '14 at 20:55
  • @YonatanNir And how is you explicitly buffering the data instead of using the existing .NET class that will do it for you any better? Especially given that your implementation isn't even working? If you want a working implementation use the framework's method, instead of re-writing it from scratch. If there is a problem with it, then *explain what the problem is* otherwise someone else will re-write the same reader with the exact same problem. – Servy Jan 13 '14 at 20:57
  • @Servy what I wrote DOES work but it's slow. Given this restrictions I wanted to know what can be done in order to make it work faster. – CodeMonkey Jan 13 '14 at 21:07
  • @YonatanNir What you wrote *won't* work. Every 32 MB it will add in a line break that wasn't there before. If you want to improve its performance then say *that*, rather than adding arbitrary restrictions that are only likely to *harm* performance. And if performance is your only concern here, then try the method proposed to you and see if it is sufficient for your purposes before rejecting it. – Servy Jan 13 '14 at 21:14
  • @Servy the title of this question is: "Read efficiently chunks of a very large file". I already wrote there that I'm searching for efficient ways of reading a file IN CHUNKS. Everyone were more comfortable with the simple line by line solution. I'm not saying that this way of reading the file is bad and it's probably even better, but I'm not the one that made the restrictions. I'm just passing them here as they were given to me. – CodeMonkey Jan 13 '14 at 21:19
  • @YonatanNir If you have some academic assignment of re-writing `BufferedStream` then go ahead and re-write it. It's right there to use as an example. It rather defeats the purpose if you just ask us to do it all for you. The .NET library has already done it for you – Servy Jan 13 '14 at 21:21
  • @AndreiRinea, the `interview-questions` tag was [*intentionally* destroyed](http://meta.stackexchange.com/q/142869/135887). Please do not add it to any more questions. – Charles Feb 24 '14 at 17:54
  • You actually have two issues with line breaks. One, the way you're going about this will introduce artificial line breaks, as others have pointed out. Also, CSV can have line breaks within a field, which you'll have to account for. You'll probably want to read this byte by byte. – Chris Anderson Feb 24 '14 at 18:10

1 Answers1

2

I would suggest simply using File.ReadLines over the file. It calls StreamReader.ReadLine underneath but it might be more efficient than handling BufferedStream over and over for 32MB chunks. So it would be as simple as:

foreach (var line in File.ReadLines(filePath))
{
    //process line 
}

Moreover, you have problem with your code because you can split line between 32MB chunks, which will not happen with the above code.

Konrad Kokosa
  • 16,563
  • 2
  • 36
  • 58
  • Is there a way to make sure the internal buffer being used will not exceed 32MB? (Even though it's safe to assume a single line won't be that large but stil...) – CodeMonkey Jan 13 '14 at 19:24
  • @YonatanNir, going through code of `File.ReadLines` we can see that `public StreamReader(string path, Encoding encoding) : this(path, encoding, true, StreamReader.DefaultBufferSize)` constructor is used, while `DefaultBufferSize` is simply 1024. – Konrad Kokosa Jan 13 '14 at 19:30
  • What if I also got a restriction that I can't read the file one line at a time? – CodeMonkey Jan 13 '14 at 20:44
  • @YonatanNir Are you trying to make this as difficult as possible for yourself? Please list all the other restrictions in your question. – Andrew Morton Jan 13 '14 at 21:06
  • @AndrewMorton This is a restriction I got. The title of my questions is about reading chunks but everyone were more comfortable with the simple line by line solution. The restriction itself is: "You cannot read the entire file, one line at a time. Maximum size of data read, except for the actual records that need to be returned for the query, cannot exceed 32MB." – CodeMonkey Jan 13 '14 at 21:09
  • SOOOOO SLOOOOW no way i would be able to use this – Brendon Vdm Sep 02 '15 at 09:35