0

The hardware, infrastructure, and redundancy are not in the scope of this question.

I am building an SSIS ETL solution needs to import ~600,000 small, simple files per hour. With my current design, SQL Agent runs the SSIS package, and it takes “n” number of files and processes them.

  • Number of files per batch “n” is configurable
  • The SQL Agent SSIS package execution is configurable

I wonder if the above approach is a right choice? Or alternatively, I must have an infinite loop in the SSIS package and keep taking/processing the files?

So the question boils down to a choice between infinite loop vs. batch+schedule. Is there any other better option?

Thank you

Allan Xu
  • 7,998
  • 11
  • 51
  • 122
  • Never to the infinite loop, at least within SSIS. It's not really designed for that. I've been told, but never seen documented, that there are opportunities for ...not memory leaks with the infinite approach but whatever it is when garbage collection never runs (until the process is closed) – billinkc May 17 '16 at 15:21
  • @billinkc, do you think SSIS is a right tool for such task I explained above? – Allan Xu May 17 '16 at 21:59
  • I would consider a different tool, for example PowerShell, if the task is to simply load files, with no transformation, then you have an access to the whole .Net framework without the overhead of the SSIS. – Dariusz Bielak May 18 '16 at 07:00

3 Answers3

2

In a similar situation, I run an agent job every minute and process all files present. If the job takes 5 minutes to run because there are alot of files, the agent skips the scheduled runs until the first one finishes so there is no worry that two processes will conflict with each other.

Joe C
  • 3,925
  • 2
  • 11
  • 31
  • Your Agent scheduling approach helps me. Thank you. Inside your package, how many files were you taking each time? Were you taking one file at a time or did you implement any batch mechanism? – Allan Xu May 17 '16 at 16:49
  • In one solution I use the for each file enumerator and let it run through all the files. I am not sure if the enumerator checks for files on each pass through the loop or takes a snapshot on loading. In another solution that uses a third party application to process the files I use a c# script with directoryinfo / fileinfo objects. I know these objects take a snapshot when instantiated. So new files added while the process is running get picked up on the next run. – Joe C May 17 '16 at 17:01
1

Is SSIS the right tool?

Maybe. Let's start with the numbers

600000 files / 60 minutes = 10,000 files per minute

600000 files / (60 minutes * 60 seconds) = 167 files per second.

Regardless of what technology you use, you're looking at some extremes here. Windows NTFS starts to choke around 10k files in a folder so you'll need to employ some folder strategy to keep that count down in addition to regular maintenance

In 2008, the SSIS team managed to load 1TB in 30 minutes which was all sourced from disk so SSIS can perform very well. It can also perform really poorly which is how I've managed to gain ~36k SO Unicorn points.

6 years is a lifetime in the world of computing so you may not need to take such drastic measures as the SSIS team did to set their benchmark but you will need to look at their approach. I know you've stated the hardware is outside of the scope of discussion but it very much is inclusive. If the file system (san, nas, local disk, flash or whatever) can't server 600k files then you'll never be able to clear your work queue.

Your goal is to get as many workers as possible engaged in processing these files. The Work Pile Pattern can be pretty effective to this end. Basically, a process asks: Is there work to be done? If so, I'll take a bit and go work on it. And then you scale up the number of workers asking and doing work. The challenge here is to ensure you have some mechanism to prevent workers from processing the same file. Maybe that's as simple as filtering by directory or file name or some other mechanism that is right for your situation.

I think you're headed down this approach based on your problem definition with the agent jobs that handle N files but wanted to give your pattern a name for further research.

I would agree with Joe C's answer - schedule the SQL Agent job to run as frequently as needed. If it's already running, it won't spawn a second process. Perhaps you're going to have multiple agents that all start every minute - AgentFolderA, AgentFolderB... AgentFolderZZH and they are each launching a master package that then has subprocesses looking for work.

Community
  • 1
  • 1
billinkc
  • 59,250
  • 9
  • 102
  • 159
0

Use WMI Event viewer watcher to know if new file arrived or not and next step you can call job scheduler to execute or execute direct the ssis package.

More details on WMI event .

https://msdn.microsoft.com/en-us/library/ms141130%28v=sql.105%29.aspx

sandeep rawat
  • 4,797
  • 1
  • 18
  • 36