5

I have 4GB+ text files (csv format) and I want to process this file using linq in c#.

I run complex linq query after load csv and convert to class?

but file size is 4gb although application memory double size of file.

how can i process (linq and new result) large files?

Thanks

Alex Aza
  • 76,499
  • 26
  • 155
  • 134
oguzh4n
  • 682
  • 1
  • 10
  • 29

3 Answers3

12

Instead of loading whole file into memory, you could read and process the file line-by-line.

using (var streamReader = new StreamReader(fileName))
{
    string line;
    while ((line = streamReader.ReadLine()) != null)
    {
        // analize line here
        // throw it away if it does not match
    }
}

[EDIT]

If you need to run complex queries against the data in the file, the right thing to do is to load the data to database and let DBMS to take care of data retrieval and memory management.

Alex Aza
  • 76,499
  • 26
  • 155
  • 134
  • 1
    What about if all Text if in a Single Line without a Carriage Return ? – Rosmarine Popcorn Jun 24 '11 at 07:30
  • 1
    @Cody - I assumed that csv file is not a single line file. – Alex Aza Jun 24 '11 at 07:33
  • 2
    Then you would process the whole line as a stream of bytes, rather than a stream of lines. – Roy Dictus Jun 24 '11 at 07:33
  • @cody then it's either one record that would potentially have to be handled at once since you can't know up front what fields are going to be in the lin query or an unusual record delimiter has been used and OP would very likely have included that since it's a crucial detail – Rune FS Jun 24 '11 at 07:37
  • I should load all data to memory for run linq query and create new result? – oguzh4n Jun 24 '11 at 07:49
  • @oguzh4n - if you want to use LINQ - yes, you need to load to memory. Another option is to create a kind of LINQ-to-File wrapper, that would fetch line by line, turn line into an object run predicates on the objects and select or throw away the line. Anyway, reading line-by-line is the only way to save the memory. – Alex Aza Jun 24 '11 at 07:54
  • I run complex linq query (self join and one more predicate) – oguzh4n Jun 24 '11 at 08:03
  • @oguzh4n - if you run complicated self-join queries, you have to load the data to the memory. – Alex Aza Jun 25 '11 at 22:57
1

I think this one is good way... CSV

Gans
  • 1,000
  • 7
  • 5
1

If you are using .NET 4.0 you could use Clay and then write a method that returns an IEnumerable line for line and that makes code like the below possible

from record in GetRecords("myFile.csv",new []{"Foo","Bar"},new[]{","})
where record.Foo == "Baz"
select new {MyRealBar = int.Parse(record.Bar)

the method to project the CSV into a sequence of Clay objects could be created like:

 private IEnumerable<dynamic> GetRecords(
                    string filePath,
                    IEnumerable<string> columnNames, 
                    string[] delimiter){
            if (!File.Exists(filePath))
                yield break;
            var columns = columnNames.ToArray();
            dynamic New = new ClayFactory();
            using (var streamReader = new StreamReader(filePath)){
                var columnLength = columns.Length;
                string line;
                while ((line = streamReader.ReadLine()) != null){
                    var record = New.Record();
                    var fields = line.Split(delimiter, StringSplitOptions.None);
                    if(fields.Length != columnLength)
                        throw new InvalidOperationException(
                                 "fields count does not match column count");
                    for(int i = 0;i<columnLength;i++){
                        record[columns[i]] = fields[i];
                    }
                    yield return record;
                }
            }
        }
Rune FS
  • 21,497
  • 7
  • 62
  • 96
  • thanks for advice, I tried this solution but it's very slow and has same memory issue. – oguzh4n Jun 24 '11 at 08:55
  • @oguzh4n oh I delibarately did not take speed into account since you didn't mention that in your post. I'd preferre readability (in this case of the call site) over speed any day. About the memory issues. If you could be more precise about those they can be fixed. This does not have to hold more than one line of the text file and one clay object at a time (and a bit) so what ever memory issues the draft above has they can be fixed – Rune FS Jun 24 '11 at 08:58