I'm considering using Data Lake technologies which I have been studying for the latest weeks, compared with the traditional ETL SSIS scenarios, which I have been working with for so many years.
I think of Data Lake as something very linked to big data, but where is the line between using Data Lake technolgies vs SSIS?
Is there any advantage of using Data Lake technologies with 25MB ~100MB ~ 300MB files? Parallelism? flexibility? Extensible in the future? Is there any performance gain when the files to be loaded are not so big as U-SQL best scenario...
What are your thoughts? Would it be like using a hammer to crack a nut? Please, don't hesitate to ask me any questions to clarify the situation. Thanks in advance!!
21/03 EDIT More clarifications:
- has to be on the cloud
- the reason I considered about using ADL is because there is no substitution for SSIS in the cloud. There is ADF, but it's not the same, it orchestrates the data, but it's not so flexible as SSIS
- I thought I could use U-SQL for some (basic) transformations but I see some problems
- There are many basic things I cannot do: loops, updates, writing logs in a SQL...
- The output can only be a U-SQL table or a file. The architecture doesn't look good this way (despite U-SQL is very good with big files, if I need an extra step to export the file to another DB or DWH) - Or maybe this is the way it's done in Big Data Warehouses... I don't know
- In my tests, It takes 40s for a 1MB file, and 1:15s for a 500MB file. I cannot justify a 40s process for 1MB (plus uploading to the Database/Data Warehouse with ADF)
- The code looks unorganised for a user, as the scripts with many basic validations will be U-SQL scripts too long.
Don't get me wrong, I really like ADL techonologies, but I think that for now, it's for something very specific and still there is no substitution for SSIS in the cloud. What do you thing? Am I wrong?