0

I have one large data table of some millions records. I need to export that into multiple CSV files of specific size. So for example, I choose file size of 5MB and when I say export, The Datatable will get exported to 4 CSV files each of size 5MB and last file size may vary due to remaining records. I went through many solutions here as well had a look at csvhelper library but all deals with large files gets split into multiple CSV but not the in memory data table to multiple CSV files based on the file size specified. I want to do this in C#. Any help in this direction would be great.

Thanks Jay

Jay Nanavaty
  • 1,089
  • 1
  • 15
  • 29
  • Number of files = total size / 5Mb. You need to estimate the total size somehow then work out how many you need then just use normal methods to split that out. Do you have a code that works for spitting files based on the _count_ of files? If so, post it and we can help you adapt it for size. – Nick.Mc Jan 15 '17 at 06:43
  • Thanks Nick. I need to convert in memory data table into multiple csv. So total file size to be determined first from the data table. It is in memory data table and not large file I want to split. – Jay Nanavaty Jan 15 '17 at 06:45
  • What is the code you've tried so far? – Sefe Jan 15 '17 at 06:47
  • Hi @sefe So far I have code just to query local SQL db table and get data table which has huge number of records. And I need to export that to multiple csv files. So that way the current code is to only query database and get Datatable object as a result. Standard ado.net code. :) – Jay Nanavaty Jan 15 '17 at 06:49
  • @nick yes. I am using this code. Finding it efficient. http://stackoverflow.com/questions/28503437/most-efficient-way-of-coverting-a-datatable-to-csv – Jay Nanavaty Jan 15 '17 at 06:51
  • Great. Is your data _always_ in the same format? Because then you can make an estimate on size per row, then you know how many rows to export per file. Then you just adjust that code – Nick.Mc Jan 15 '17 at 06:52
  • Thanks Nick. Exactly I want to estimate size here. The problem is that datatable columns would be different. So can't go with any fixed size here. – Jay Nanavaty Jan 15 '17 at 06:54
  • Let me rephrase: what have you tried so far _to solve your specific problem_? – Sefe Jan 15 '17 at 07:03
  • Then the challenge is to find out how much space the file will take upon export. I googled it and I did find one method for finding the size of an object in memory but I believe the datatable has a lot of overhead. Therefore I think the most practical solution is to come up with a rough size per row figure, export the count of files that you think is right, measure the size of the files and repeat – Nick.Mc Jan 15 '17 at 07:08
  • 1
    Why hold the csv in memory? Write it line by line and check the size `FileStream.Length`. When >5MB, create the next file. – H.G. Sandhagen Jan 15 '17 at 07:13
  • @H.G.Sandhagen Make sense. Will give it a try now then. – Jay Nanavaty Jan 15 '17 at 07:22
  • Is data in a DataBase? SQL server has a command line executable SQLCMD.EXE that will output a query results in csv that runs much faster than doing it in c#. – jdweng Jan 15 '17 at 07:25
  • @jdweng Thanks for the reply. I know that SQL server has that executable. But eventually my solution is going to work with any database such as MySql, Oracle, Sql etc. So it has to be platform agnostic solution. – Jay Nanavaty Jan 15 '17 at 07:28
  • @H.G.Sandhagen Though I have started to explore in the direction you said, do you have any useful links or something where I can get some help? It seems that what you have said is going to be the solution for my problem. – Jay Nanavaty Jan 15 '17 at 07:31
  • I usually use following : Foreach(DataRow row in dt.AsEnumerable()) { writer.Writeline(string.Join(",",row.ItemArray.Select(x => x.ToString())) } – jdweng Jan 15 '17 at 07:43

2 Answers2

3

Thanks @H.G.Sandhagen and @jdweng for the inputs. Currently I have written following code which does the work needed. I know it is not perfect and some enhancement can surely be done and can be made more efficient if we can pre-determine length out of data table item array as pointed out by Nick.McDermaid. As of now, I will go with this code to unblock my self and will post the final optimized version when I have it coded.

public void WriteToCsv(DataTable table, string path, int size)
        {
            int fileNumber = 0;
            StreamWriter sw = new StreamWriter(string.Format(path, fileNumber), false);
            //headers  
            for (int i = 0; i < table.Columns.Count; i++)
            {
                sw.Write(table.Columns[i]);
                if (i < table.Columns.Count - 1)
                {
                    sw.Write(",");
                }
            }
            sw.Write(sw.NewLine);

            foreach (DataRow row in table.AsEnumerable())
            {
                sw.WriteLine(string.Join(",", row.ItemArray.Select(x => x.ToString())));
                if (sw.BaseStream.Length > size) // Time to create new file!
                {
                    sw.Close();
                    sw.Dispose();
                    fileNumber ++;
                    sw = new StreamWriter(string.Format(path, fileNumber), false);
                }
            }

            sw.Close();
        }
Jay Nanavaty
  • 1,089
  • 1
  • 15
  • 29
  • I created a similar answer, but you were faster. Btw.: Three things to mention: 1. The string seperator (") normally used in csv is missing. Depending on your data you may have to improve this. 2. Your csv files (except the first one) is missing the header line. 3. To create a header line you could use `var header = String.Join(",", datatable.Columns.Cast().Select(p => $"\"{p.ColumnName}\""));` – H.G. Sandhagen Jan 15 '17 at 08:17
  • Yes @H.G.Sandhagen Thanks for listing out the improvements. Actually I know that and that's why I put that in to answer that this code is yet to be optimized. In my case, I will not need header at all in any of the csv file. I am storing it in one different file where I have table schema of server table as well. – Jay Nanavaty Jan 15 '17 at 08:28
0

I had a similar problem and this is how I solved it with CsvHelper.

Answer could be easily adapted to use DataTable as source.

public void SplitCsvTest()
{

  var inventoryRecords = new List<InventoryCsvItem>();

  for (int i = 0; i < 100000; i++)
  {
    inventoryRecords.Add(new InventoryCsvItem { ListPrice = i + 1, Quantity = i + 1 });
  }

  const decimal MAX_BYTES = 5 * 1024 * 1024; // 5 MB

  List<byte[]> parts = new List<byte[]>();

  using (var memoryStream = new MemoryStream())
  {
    using (var streamWriter = new StreamWriter(memoryStream))
    using (var csvWriter = new CsvWriter(streamWriter))
    {
      csvWriter.WriteHeader<InventoryCsvItem>();
      csvWriter.NextRecord();

      csvWriter.Flush();
      streamWriter.Flush();

      var headerSize = memoryStream.Length;
      foreach (var record in inventoryRecords)
      {

        csvWriter.WriteRecord(record);
        csvWriter.NextRecord();

        csvWriter.Flush();
        streamWriter.Flush();

        if (memoryStream.Length > (MAX_BYTES - headerSize))
        {
          parts.Add(memoryStream.ToArray());

          memoryStream.SetLength(0);
          memoryStream.Position = 0;

          csvWriter.WriteHeader<InventoryCsvItem>();
          csvWriter.NextRecord();
        }
      }

      if (memoryStream.Length > headerSize)
      {
        parts.Add(memoryStream.ToArray());
      }
    }


  }

  for(int i = 0; i < parts.Count; i++)
  {
    var part = parts[i];

    File.WriteAllBytes($"C:/Temp/Part {i + 1} of {parts.Count}.csv", part);
  }
}
Greg R Taylor
  • 3,470
  • 1
  • 25
  • 19