Parsing a big CSV file C# .net 4

Question

I know this question has been asked before, but I can't seem to get it working with the answers I've read. I've got a CSV file ~ 1.2GB , If I'm running the process like a 32bit i get outOfMemoryException, it works if i run it as a 64bit process, but it still takes 3,4gb in memory, i do know that I'm storing a lot of data in my customData class, but still 3,4gb of ram?, Am I doing something wrong when reading the file? dict is a dictionary in which i just have a mapping to which property to save something in, depending on the column it's in. Am i doing the reading the right way?

StreamReader reader = new StreamReader(File.OpenRead(path));
while(!reader.EndOfStream)  {
            String line = reader.ReadLine();
            String[] values = line.Split(';');
            CustomData data = new CustomData();
            string value;
            for (int i = 0; i < values.Length; i++) {
                dict.TryGetValue(i, out value);
                Type targetType = data.GetType();
                PropertyInfo prop = targetType.GetProperty(value);
                if(values[i]==null)
                {
                    prop.SetValue(data, "NULL",null);
                }
                else
                {
                    prop.SetValue(data, values[i], null);
                }

            }
            dataList.Add(data);
        }

first of all u dont get to use the whole memory for a c# process. and i would recommend you use lumenworks csv parser. and dont use reflection. why do you need reflection? it s csv file. u are torturing yourself. — DarthVader, Jul 13 '12 at 07:39
see this for another explanation that you dont have all the memory for urself. http://stackoverflow.com/questions/1109558/allocating-more-than-1-000-mb-of-memory-in-32-bit-net-process — DarthVader, Jul 13 '12 at 07:40
Dou you really have to keep the whole parsed data in memory ? Maybe you should consider storing them somewhere else (database ?, file with binary serialization ?...). Could you give us an insight of your CustomData class definition ? — Julien Ch., Jul 13 '12 at 07:41
Thanks, but I'm aware of this, I'm just wondering if I'm doing the reading wrong, it seems like a lot of people on stackoverflow recommends using streamreader, just thinking that I might be using it the wrong way. — najk, Jul 13 '12 at 07:42
lot of people would recommend you to use a csv parser. you are trying to reinvent the wheel but apparently it s orthogonal. — DarthVader, Jul 13 '12 at 07:43
The data is to be sent over the net, I'm thinking of maybe reading some, sending some, reading some, sending some. Dunno if i should thread or not. — najk, Jul 13 '12 at 07:43
Ah. you can do async right there, while it s sending, you can read in more data when it s done sending, send the new data that you just read in. would be more complicated to code. — DarthVader, Jul 13 '12 at 07:44
Sounds fair enough, i will give it a try!, that means i won't have to store that much in memory, cause i will empty my CustomData list every time i send. — najk, Jul 13 '12 at 07:46
Well, nothing to do with memory usage, but you can use File.OpenText returning directly a StreamReader. Why don't you try ti use some external libraries like [FileHelpers](http://www.filehelpers.com/) specifically build for this kind of work? — Steve, Jul 13 '12 at 07:47
You may want to read into a fixed size buffer (or Array if you will) of CustomData so that you don't queue up a lot of entries taking up all your memory in case you have network lag/connectivity issues. Then you would wait with reading until you have room in your buffer. — Simon Ejsing, Jul 13 '12 at 08:05
Not sure if this is important, but (values[i] == null) can never evaluate to true. — user1096188, Jul 13 '12 at 08:37

Julien Ch. · Accepted Answer · 2012-07-13T08:23:06.647

There doesn't seem to be anything wrong in your usage of the stream reader, you read a line in memory, then forget it.

However, in C# a string is encoded in memory as UTF-16 so on the average a character consumes 2 bytes in memory.

If your CSV contains also a lot of empty fields that you convert to "NULL" you add up to 7 bytes for each empty field.

So on the whole, since you basically store all the data from your file in memory, it's not really surprising that you require almost 3 times the size of the file in memory.

The actual solution is to parse your data by chucks of N lines, treat them, and free them from memory.

Note: Consider using a CSV parser, there is more to CSV than just comas or semi-colons, what if one of your field conatins a semi-colon, a newline, a quote... ?

Edit

Actually each string take up to 20+(N/2)*4 bytes in memory see C# in Depth

score 3 · Answer 2 · answered Jul 13 '12 at 08:10

Ok a couple of points here.

As pointed out in the comments, .NET under x86 can only consume 1.5GBytes per process, so consider that your maximum memory in 32 bit
The StreamReader itself will have an overhead. I don't know if it caches the entire file in memory, or not (maybe someone can clarify?). If so, reading and processing the file in chunks might be a better solution
The CustomData class, how many fields does it have, and how many instances are created? Note you will need 32bits for each reference in x86 and 64 bits for each reference in x64. So if you have CustomData class, which has 10 fields of type System.Object, each CustomData class before storing any data requires 88 bytes.
The dataList.Add at the end. I assume you are adding to a generic List? If so, note that List employes a doubling algorithm to resize. If you have 1GByte in a List and it requires 1 more byte in size, it will create a 2GByte array and copy the 1GByte to the 2GByte array on resize. So all of a sudden the 1GByte + 1 byte actually requires 3GBytes to manipulate. Another alternative is to use a pre-sized array

Thanks mate!, Now that you mention it, I've heard about the doubling of generics before.. — najk, Jul 19 '12 at 07:57
Yes its easy to see - download ILSpy and open up the code for List and take a look. When a new point is appended, it checks for size of the backing array, and if its not big enough, it creates a new array of size 2*size and performs a copy. So at one instance in time you have 3*size elements in memory just to add a single point! — Dr. Andrew Burnett-Thompson, Jul 19 '12 at 08:47

Parsing a big CSV file C# .net 4

2 Answers2

Linked