1

I have built one SSIS package to load data from CSV files to a database table. The CSV files are first downloaded from Azure blob using a power shell script and then each of these files is loaded to a target table in SQL Server.

So I setup a ForEach Enumerator to loop through all the files and load data to the target table but the process is too slow. Each file has just one row of data (around 30 columns) and so to load say 20k rows I have to loop through 20k files and the package takes HOURS to run.

I tried looking for alternatives to load data from multiple files but couldn't find any concrete solution.One guy Hilmar has an interesting solution to use script task to improve performance but I don't have any C# know-how what so ever.

Has anyone run into a similar problem or overcome the same ? Or if anyone has a sample to load multiple files using a script task, it would help a lot (?)

Any help is appreciated.

Gareth
  • 5,140
  • 5
  • 42
  • 73
A.B
  • 21
  • 5
  • Considered that having those files as they are is not smart. 20k files are a LOT of io operations - whatever you do, that is going to be lot slower than loading 200 files with a lot of lines each file. – TomTom Feb 22 '16 at 12:10
  • 1
    Just the fact that you have a lot of very small files is a big problem. Not for SSIS, for any tool, including Hadoop. Either change your script to download a single file, or concatenate the small files to one big file. You don't explain what your package does, but if you run a full process inside the iterator you are also wasting a lot of time. Use the iterator to load all data in a single staging table, then process the staged data in bulk – Panagiotis Kanavos Feb 22 '16 at 12:14
  • Also consider that these small files cause a serious waste of space, since each of them will use a full disk page (4KB). – Panagiotis Kanavos Feb 22 '16 at 12:15
  • I once had a similar problem and found out that two major performance losses were (1) opening all these files and (2) that a ***new connection*** is generated each time you import one of these files. The first of the two cannot be avoided but the time can be reduced by first making sure that all these files are locally available and not on a network drive (or some other remote location). For the second part I concatenated (in memory a bulk of aprox. 500 files before sending it to the server for import. So, 1 connection every 500 files. That helped a lot too. – Ralph Feb 22 '16 at 12:16
  • 1
    What about MULTIFLATFILE Connection Manager mentioned in the same article by Hilmar Buchta? You do have .csv(s), it should work for you. – Y.B. Feb 22 '16 at 12:22
  • @PanagiotisKanavos the package doesn't do much.. just one derived column and one conditional split are being used so i guess that rules out the execution time wasted inside the iterator.... i know that its just a ridiculous number of files to load but thats what the requirement is ....as my team doesn't have any control over how data is stored (i.e I can't concatenate data at the source itself) I believe i will get data as is... one row in one file..... don't know what prompted such a move but i guess the other team must have their own reasons for storing data like that. – A.B Feb 22 '16 at 14:06
  • You *can* concatenate the files once you copy them localy. Otherwise you are wasting time opening/closing connections. – Panagiotis Kanavos Feb 22 '16 at 15:10
  • @PanagiotisKanavos.. u mean I should first create a package to concatenate all the input files into a single one and then run the load.. right ? – A.B Feb 22 '16 at 16:04
  • @Ralph.. can you elaborate a little as to how you concatenated 500 files in memory before sending to server ? – A.B Feb 22 '16 at 16:11
  • In our case we didn't want to use SSIS because we needed / wanted a more "controlled" import with immediate feedback to the user (for possible interventions). So, we used Excel with VBA as shown here: http://stackoverflow.com/questions/26743347/faster-way-to-import-excel-spreadsheet-to-array-with-ado/30346959#30346959 Note, that the T-SQL command being passed to the server contains several inserts at once. We got the best performances with strings between 500kb and 1MB (per connection) uploading aprox. 10MB per Minute (merely using Excel and VBA). – Ralph Feb 22 '16 at 16:22
  • @A.B, no, I meant add a single step to concatenate all files. A simple `copy *.csv > allfiles.txt` will do. An enumerator that appends each file to a single target file can do the same, but will take more time – Panagiotis Kanavos Feb 23 '16 at 09:11
  • @PanagiotisKanavos already tried that using cmd but somehow all the files get appended with a ";" instead of a LF or New Line char at the end. So when i import the single output file in ssis i see a thousand columns instead of the desired 30.. cant find any resource on how to add a new line before copying the next csv :( – A.B Feb 23 '16 at 09:53
  • @Y.B. don't know if MULTIFLATFILE traverses subfolders.. any idea ? – A.B Feb 23 '16 at 09:55
  • As far as I can see it does not, but if subfolder names are known they can be listed all trough _|_: [Multiple Flat Files Connection Manager](https://msdn.microsoft.com/en-gb/library/ms137830.aspx) – Y.B. Feb 23 '16 at 10:11
  • @A.B copy doesn't generate semicolons out of the blue - in fact, it doesn't use any kind of separator. The semicolon is there because it was in the data or the command line. Perhaps the source filed *don't* contain any newlines at all? – Panagiotis Kanavos Feb 23 '16 at 10:44

2 Answers2

1

To conclude the comments conversation here is a script Merging multiple CSV files into one using PowerShell to load all the data in one go (assuming all the files are of the same format) with a tiny tweak to traverse subfolders and append caret return to the end of each file:

if (Test-Path "COMBINED_FILE.csv") {Remove-Item "COMBINED_FILE.csv"}

$getFirstLine = $true

Get-ChildItem "SOURCE_ROOT_FOLDER\*.csv" -Recurse -File | foreach {
    $filePath = $_.FullName

    $lines = Get-Content $filePath
    $linesToWrite = switch($getFirstLine) {
           $true  {$lines}
           $false {$lines | Select -Skip 1}
    } + [System.Environment]::NewLine

    $getFirstLine = $false
    Add-Content "COMBINED_FILE.csv" $linesToWrite
}
Community
  • 1
  • 1
Y.B.
  • 3,526
  • 14
  • 24
0

Kind of feel a little outwitted here. I deployed my package to Integration Services and scheduled a run for same via SQL Agent.

Guess what !! A package that took 12 hours to load 6k files, now loads 20k files in under 30 mins. Never ever would I have thought that executing a package in SSDT and executing it on server would have such contrasting results.

Now I am not sure as to what the exact reason for this could be but I guess the time wasted in logging all the execution results in SSDT could have made a big difference.

Will search for the exact reason for such behaviour but this significant decrease in execution time is acceptable to me as I don't have a large number of files to load data from every day.

Would have gone for concatenating the files option had I not needed the original files as we have added a mail task now to send files with errors(truncation/data) back to the dev team.

Thanks for the help though @everyone.

A.B
  • 21
  • 5