10

Update: Here's a similar question


Suppose I have a DataTable with a few thousand DataRows in it.

I'd like to break up the table into chunks of smaller rows for processing.

I thought C#3's improved ability to work with data might help.

This is the skeleton I have so far:

DataTable Table = GetTonsOfData();

// Chunks should be any IEnumerable<Chunk> type
var Chunks = ChunkifyTableIntoSmallerChunksSomehow; // ** help here! **

foreach(var Chunk in Chunks)
{
   // Chunk should be any IEnumerable<DataRow> type
   ProcessChunk(Chunk);
}

Any suggestions on what should replace ChunkifyTableIntoSmallerChunksSomehow?

I'm really interested in how someone would do this with access C#3 tools. If attempting to apply these tools is inappropriate, please explain!


Update 3 (revised chunking as I really want tables, not ienumerables; going with an extension method--thanks Jacob):

Final implementation:

Extension method to handle the chunking:

public static class HarenExtensions
{
    public static IEnumerable<DataTable> Chunkify(this DataTable table, int chunkSize)
    {
        for (int i = 0; i < table.Rows.Count; i += chunkSize)
        {
            DataTable Chunk = table.Clone();

            foreach (DataRow Row in table.Select().Skip(i).Take(chunkSize))
            {
                Chunk.ImportRow(Row);
            }

            yield return Chunk;
        }
    }
}

Example consumer of that extension method, with sample output from an ad hoc test:

class Program
{
    static void Main(string[] args)
    {
        DataTable Table = GetTonsOfData();

        foreach (DataTable Chunk in Table.Chunkify(100))
        {
            Console.WriteLine("{0} - {1}", Chunk.Rows[0][0], Chunk.Rows[Chunk.Rows.Count - 1][0]);
        }

        Console.ReadLine();
    }

    static DataTable GetTonsOfData()
    {
        DataTable Table = new DataTable();
        Table.Columns.Add(new DataColumn());

        for (int i = 0; i < 1000; i++)
        {
            DataRow Row = Table.NewRow();
            Row[0] = i;

            Table.Rows.Add(Row);
        }

        return Table;
    }
}
Community
  • 1
  • 1
Michael Haren
  • 105,752
  • 40
  • 168
  • 205
  • I'm not using this for paging but I just realized the parallel application. If you find any good duplicates that apply here please let me know and I'll close the question. – Michael Haren Apr 20 '09 at 17:54
  • You could inline it by creating an extension method that does the above. Then you could use var Chunks = from chunk in table.Chunkify select chunk; – Jacob Proffitt Apr 20 '09 at 18:17

6 Answers6

7

This is quite readable and only iterates through the sequence once, perhaps saving you the rather bad performance characteristics of repeated redundant Skip() / Take() calls:

public IEnumerable<IEnumerable<DataRow>> Chunkify(DataTable table, int size)
{
    List<DataRow> chunk = new List<DataRow>(size);

    foreach (var row in table.Rows)
    {
        chunk.Add(row);
        if (chunk.Count == size)
        {
            yield return chunk;
            chunk = new List<DataRow>(size);
        }
    }

    if(chunk.Any()) yield return chunk;
}
mqp
  • 70,359
  • 14
  • 95
  • 123
  • This is certainly the most readable approach, but it forces the creation of an in-memory list for each chunk. That's reasonable for small chunk sizes but not for large ones. Also, I'd clear the chunk rather than creating a brand new one each time. I think the best solution would probably create an iterator which has a sub iterator, but that's certainly not going to be easy to read. – Damian Powell Sep 26 '09 at 16:22
5

This seems like an ideal use-case for Linq's Skip and Take methods, depending on what you want to achieve with the chunking. This is completely untested, never entered in an IDE code, but your method might look something like this.

private List<List<DataRow>> ChunkifyTable(DataTable table, int chunkSize)
{
    List<List<DataRow>> chunks = new List<List<DataRow>>();
    for (int i = 0; i < table.Rows.Count / chunkSize; i++)
    {
        chunks.Add(table.Rows.Skip(i * chunkSize).Take(chunkSize).ToList());
    }
    
    return chunks;
}
clamchoda
  • 4,411
  • 2
  • 36
  • 74
Jacob Proffitt
  • 12,664
  • 3
  • 41
  • 47
  • @Jacob: thanks for the suggestion. I made this return tables and implement IEnumerable with the yield keyword. I also turned it into an extension method as you suggested. Works great! – Michael Haren Apr 20 '09 at 18:36
  • 1
    This is broken for datasets not evenly divisible by `chunkSize`. 1/10 = 0 when using ints, so the last part of the table is cut off. – N_A Feb 20 '15 at 03:15
  • Yeah, you have a point. At least, when rowcount is less than chunksize it'll fail on the first test (as i will be zero and not < zero so the for loop will fail to enter). i increments after evaluation, though, so 11/10 will run twice and pick up all 11. – Jacob Proffitt Feb 20 '15 at 18:27
  • here is the c# and vb.net version of the code without using LINQ which worked for me [splitt datatable into chunks](http://stackoverflow.com/questions/34663933/split-datatable-into-multiple-fixed-sized-tables/43397945#43397945) – Jeff D Apr 13 '17 at 17:04
0

Here's an approach that might work:

public static class Extensions
{
    public static IEnumerable<IEnumerable<T>> InPages<T>(this IEnumerable<T> enumOfT, int pageSize)
    {
        if (null == enumOfT) throw new ArgumentNullException("enumOfT");
        if (pageSize < 1) throw new ArgumentOutOfRangeException("pageSize");
        var enumerator = enumOfT.GetEnumerator();
        while (enumerator.MoveNext())
        {
            yield return InPagesInternal(enumerator, pageSize);
        }
    }
    private static IEnumerable<T> InPagesInternal<T>(IEnumerator<T> enumeratorOfT, int pageSize)
    {
        var count = 0;
        while (true)
        {
            yield return enumeratorOfT.Current;
            if (++count >= pageSize) yield break;
            if (false == enumeratorOfT.MoveNext()) yield break;
        }
    }
    public static string Join<T>(this IEnumerable<T> enumOfT, object separator)
    {
        var sb = new StringBuilder();
        if (enumOfT.Any())
        {
            sb.Append(enumOfT.First());
            foreach (var item in enumOfT.Skip(1))
            {
                sb.Append(separator).Append(item);
            }
        }
        return sb.ToString();
    }
}
[TestFixture]
public class Tests
{
    [Test]
    public void Test()
    {
        // Arrange
        var ints = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
        var expected = new[]
        {
            new[] { 1, 2, 3 },
            new[] { 4, 5, 6 },
            new[] { 7, 8, 9 },
            new[] { 10      },
        };

        // Act
        var pages = ints.InPages(3);

        // Assert
        var expectedString = (from x in expected select x.Join(",")).Join(" ; ");
        var pagesString = (from x in pages select x.Join(",")).Join(" ; ");

        Console.WriteLine("Expected : " + expectedString);
        Console.WriteLine("Pages    : " + pagesString);

        Assert.That(pagesString, Is.EqualTo(expectedString));
    }
}
Damian Powell
  • 8,655
  • 7
  • 48
  • 58
0

Jacob wrote

This seems like an ideal use-case for Linq's Skip and Take methods, depending on what you want to achieve with the chunking. This is completely untested, never entered in an IDE code, but your method might look something like this.

private List<List<DataRow>> ChunkifyTable(DataTable table, int chunkSize)
{
    List<List<DataRow>> chunks = new List<List<DaraRow>>();
    for (int i = 0; i < table.Rows.Count / chunkSize; i++)
    {
        chunks.Add(table.Rows.Skip(i * chunkSize).Take(chunkSize).ToList());
    }

    return chunks;
}

Thanks for this Jacob - useful for me but I think the test in your example should be <= not <. If you use < and the number of rows is less than chunkSize the loop is never entered. Similarly the last partial chunk is not captured, only full chunks. As you've stated, the example is untested, etc so this is just an FYI in case someone else uses your code verbatim ;-)

David Clarke
  • 12,888
  • 9
  • 86
  • 116
0

Here is a completely differed approach. No memory is allocated for the chunks.

public static IEnumerable<IEnumerable<DataRow>> Chunkify(
    this DataTable dataTable, int chunkSize)
{
    for (int i = 0; i < dataTable.Rows.Count; i += chunkSize)
    {
        yield return GetChunk(i, Math.Min(i + chunkSize, dataTable.Rows.Count));
    }
    IEnumerable<DataRow> GetChunk(int from, int toExclusive)
    {
        for (int j = from; j < toExclusive; j++)
        {
            yield return dataTable.Rows[j];
        }
    }
}

Usage example:

var dataTable = GetTonsOfData();
foreach (var chunk in dataTable.Chunkify(1000))
{
    Console.WriteLine($"Processing chunk of {chunk.Count()} rows");
    foreach (var dataRow in chunk)
    {
        Console.WriteLine(dataRow[0]);
    }
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
0

.NET (Core) 6 introduced the Chunk extension method that can be used to easily split a DataTable into batches:

IEnumerable<DataRow[]> chunks=myTable.AsEnumerable()
                                     .Chunk(1000);

In earlier versions MoreLINQ's Batch extension method can be used to do the same:

IEnumerable<IEnumerable<DataRow>> chunks=myTable.AsEnumerable()
                                                .Batch(1000);

Both can be used to split a DataTable into smaller onces. The following extension method does this, using a LoadRows helper to extract the row loading code:

public static IEnumerable<DataTable> Chunk(this DataTable source, int size)
{
    ArgumentNullException.ThrowIfNull(source);
    foreach (var chunk in source.AsEnumerable().Chunk(size))
    {
        var chunkTable = source.Clone();
        chunkTable.MinimumCapacity = size;
        chunkTable.LoadRows(chunk);
        yield return chunkTable;
    }
}

public static DataTable LoadRows(this DataTable table, IEnumerable<DataRow> rows)
{
    ArgumentNullException.ThrowIfNull(table);
    ArgumentNullException.ThrowIfNull(rows);
    
    foreach (var row in rows)
    {
        table.ImportRow(row);
    }

    return table;
}

ArgumentNullException.ThrowIfNull(source); is another .NET Core addition that throws an ArgumentNullException using the parameter name if the parameter is null.

Finally chunkTable.MinimumCapacity = size; is used to reserve space for each table's rows, to avoid reallocations

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236