1

I have a large CSV file of 7GB and inside CSV, there are fields that have line breaks inside text. I am able to split the large CSV using C# using below. But as the string contains line breaks, it starts a new line from there. I cannot replace the linebreak using readline() as it throws out of memory exception as file is huge.

using (StreamReader reader = new StreamReader(inputFilePath))
        {
            int fileCount = 0;
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                if (fileCount % batchSize == 0)
                {
                    string outputFilePath = Path.Combine(outputDirectory, $"output_{fileCount / batchSize}.csv");
                    using (StreamWriter writer = new StreamWriter(outputFilePath))
                    {
                        writer.WriteLine(line);
                    }
                }
                else
                {
                    string outputFilePath = Path.Combine(outputDirectory, $"output_{fileCount / batchSize}.csv");
                    using (StreamWriter writer = new StreamWriter(outputFilePath, true))
                    {
                        writer.WriteLine(line);
                    }
                }
                fileCount++;
            }
        }

Above code successfully read large file and splits it. Just that it also takes linebreaks inside the a column and splits it to another line. And as I mentioned above, I cannot replace the line using Readline().Replace as it will throw out of memory exception.

Please advise how to perform both operation at same time.

If not c#, PowerShell would also work.

I used below in PowerShell, but that also had same issue:

Import-Csv -Path "C:\largefile.csv" | Group-Object -Property { [math]::Floor($_.PSObject.Properties.Count / 10000) } | ForEach-Object { $_.Group | Export-Csv -Path "C:\smallfile$($_.Name).csv" -NoTypeInformation }
404
  • 249
  • 6
  • 16
  • You'll need to process a character at a time and track when you are reading inside a quoted field. Unfortunately, even though csv is an old format, too many parsers are broken. – Jeremy Lakeman Apr 06 '23 at 01:46
  • Also, performance is going to be terrible if you close and open your output files for every line. – Jeremy Lakeman Apr 06 '23 at 01:51
  • you only have to count the quote characters in the line. if the count is even you have a full record, if it is odd you have to read the next line too and repeat counting – Sir Rufo Apr 06 '23 at 02:03
  • Why do you keep opening and closing/disposing the _StreamWriter_? – Tu deschizi eu inchid Apr 06 '23 at 02:40
  • The following may be of interest: https://www.c-sharpcorner.com/forums/c-sharp-how-to-read-a-large-1-gb-txt-file-in-net and https://stackoverflow.com/questions/4273699/how-to-read-a-large-1-gb-txt-file-in-net – Tu deschizi eu inchid Apr 06 '23 at 02:51

0 Answers0