Azure : what is the best technology to implement a long-running ETL process : Function or WebJob?

Question

I am migrating long-running ETL processes from on-premise to Azure. The slowest one takes up to 2 hours to complete (read CSV files by FTP and writes data to a SQL Azure Database). I read lots of articles on the Web about the pros and cons of both technologies but it would be great to get some feedback from the community. I would prefer to use Azure Functions because "it's new" but it seems that Azure Functions are not designed to support long-running processes.

Thanks for your feedback! Regards

Have you looked at SSIS? I would not be surprised if SSIS was able to knock that 2 hours down to minutes. — , May 05 '17 at 16:19
Thanks but SSIS it not an option, because : there is no SSIS PaaS offering on Azure and my ETL process is purely C# code (as fast as SSIS). — RudyCo, May 05 '17 at 16:35
Azure Function can't run for more than 5 minutes on consumption plan, so you'll deal with reserved instances in both cases. — Mikhail Shilkov, May 05 '17 at 16:35
Thanks. I knew that. I can use a standard plan, not an issue. — RudyCo, May 05 '17 at 16:38
The ETL tool in the Azure cloud is Azure Data Factory, did you take a look at that? — Peter Bons, May 05 '17 at 16:56
You can have long running Azure Functions in App Service Plan if you turn on the AlwaysOn feature. Feel free to try it out and see if the pipeline will work for your use-case. — Ling Toh, May 05 '17 at 18:48
Have you seen... https://learn.microsoft.com/en-us/sql/integration-services/azure-feature-pack-for-integration-services-ssis... and one of the things I was able to take advantage of in SSIS was the ability to run C# modules on the data stream. Just trying to throw out some things that, while not answers, might help... Sorry if it is not helpful. — , May 08 '17 at 15:38

score 1 · Answer 1 · edited May 23 '17 at 12:34

Assuming that you don't read 1 file for 2 hours, but instead you read many files, each of which can be processed in under 5 minutes, you can use Azure Functions on consumption plan. So, first criteria that you should consider is whether that is true, or if you can slice your FTP requests in more, but smaller.

Azure Functions use the same SDK as WebJobs under the hood, but you get the benefit of faster time to running code, and you get to worry less about managing them. That is good for general case, but if you want more control, WebJobs provide you that. On the other side of spectrum, for full control, you can use Azure VMs. This answer gives you a nice overview of Functions vs WebJobs.

One idea, if you are willing to move your CSV files, you may use Azure Data Lake in combination with U-SQL.

I am not really sure how your current pipeline receives data, but you can set up a pipeline using Azure Functions to store all your data in Data Lake. Since you store files that you receive using FTP, you don't have to have long running Azure Function. You would be able to run Azure Function on consumption plan that stores only smaller amounts of data in Data Lake per each run, so that it doesn't timeout.

You can then prepare data using various analytical capabilities of U-SQL and other Data lake analytics services. With UDOs, or Azure Data Factory, you can load data into SQL Azure Database. Good thing for U-SQL is that you pay compute only when you use it, like in Azure Functions, so you can have the whole pipeline using "serverless" computing.

Azure : what is the best technology to implement a long-running ETL process : Function or WebJob?

1 Answers1