1

I'm looking for an example of a simple data flow in C# without relying on SSIS or external libraries such as Rhino-ETL (which is a very good library, BTW).

Requirements:

  • One arbitrary ADO .Net data source
  • One arbitrary ADO .Net data destination
  • Does not require entire dataset to be loaded into memory at once, so that it can handle arbitrarily large data sets. It would need to rely on a buffer of some sort, or "yield return" like Rhino ETL takes advantage of.
  • Uses bulk insert (i.e. SqlBulkCopy)
  • Minimal transformation. No lookups, no merge joins.
  • Multi-threaded is not necessary if single threaded can do the job.

Another way of stating the question ... how does Rhino ETL do this, but without all the abstractions and inherited classes, and without the quacking dictionary? I would like to see it in a simple non-abstract class.

And yet another rephrasing of the question: I'm looking for the fundamental example of taking a data flow output of a "select" query, and bulk inserting it at 10,000 or 50,000 records at a time to a destination without loading the entire result into memory, which could potentially exceed available RAM.

Korey
  • 929
  • 1
  • 7
  • 14
  • I'm considering offering a bounty (although I can barely afford one) ... what about this question needs clarification? – Korey Jan 10 '12 at 19:40
  • May I ask why you can't use Rhino ETL if it fits your requirements? – alexn Jan 10 '12 at 20:05
  • 1
    Partially as an educational process ... I'm surprised someone doesn't have a blog about this sort of thing. Partially political, as getting a third party library approved is next to impossible. I want to get away from SSIS which, as the project gets larger, it gets more unwieldy as refactoring 100+ SSIS packages approaches unrealistic. – Korey Jan 13 '12 at 15:18
  • I should rephrase (and I missed my 5 minute edit window) .. I haven't found a blog about it. There may be one out there that I missed. – Korey Jan 13 '12 at 15:26

2 Answers2

1

It looks like you want to learn how an etl program works to increase your knowledge of programming. Rhino ETL is an open source project, so you can get the source here:

https://github.com/ayende/rhino-etl

and see exactly how they do it. There are also other ETL packages that are Open Source so you can see the way that they do things differently. For example talend source can be found at:

http://www.talend.com/resources/source-code.php

Of course, if you are trying to write your own code for commercial use, you will not want to see the source code of others, so you will need to come up with your process on your own.

Hope this helps you!

William Salzman
  • 6,396
  • 2
  • 33
  • 49
  • I'm looking for the fundamental example of taking a data flow output of a "select" query, and bulk inserting it at 10,000 or 50,000 records at a time to a destination without loading the entire result into memory, which could potentially exceed available RAM. – Korey Jan 13 '12 at 15:33
1

Far from a complete answer I'm afraid.

You can "page" the results of an arbitrary select query within .Net using one or more of the techniques outlined here.

http://msdn.microsoft.com/en-us/library/ff650700.aspx

This should allow you to chunk up the data and avoid RAM concerns.

Alternatively - if your existing SSIS packages are simple / similar enough, then it could be worth while to have a look at generating the SSIS packages automatically based on a template. For example I'm maintaining 100+ packages that are automatically generated by a small c# application using the EzAPI API for SSIS.

Steve Homer
  • 3,852
  • 2
  • 22
  • 41
  • Web app paging isn't what I'm looking for. I do want to query the entire result in one sql query, but handle the data as it arrives, not after it has all arrived. EzAPI does look like an interesting solution to managing hundreds of packages. I'll take a look at that. – Korey Jan 20 '12 at 19:53
  • When I said paging I was thinking more of splitting the data into chunks that you request at will. You'd do this for populating a webpage yes but it's useful for client apps to implement streaming and limiting memory use. – Steve Homer Jan 27 '12 at 09:24