Azure Data Factory split file by file size

Question

out of my two weeks of Azure experience. I want to split files based on a size. For example there is a table with 200k rows I would like to set a parameter to split that table into multiple files with a limit of 100Mb per file (if that makes sense). It will return N number of files depending of the table size. something like:

my_file_1ofN.csv

I was walking through the documentation, blogs and videos and could do some POC with Azure Functions, Azure Batch, and Databricks with a python script in my personal account. The problem is the company doesn't let me use any of these approaches.

So I split the file using the number of partitions but these files are with a different sizes depending on the table and the partition.

Is there a way to accomplish this? I'm experimenting with lookups and foreach activities in the pipeline now but with not good results.

Any idea or clue will be welcome. Thanks!!

Joel Cochran · Accepted Answer · 2020-10-15T20:26:37.117

I haven't been able to figure this out by size, but if you can get a total row count, you can use DataFlow to output a rough approximation based on row count.

IN THE PIPELINE:

In this example, I am reading data out of an Azure Synapse SQL Pool, so I'm running a Lookup to calculate the number of "Partitions" based on 8,000,000 rows per partition:

I then capture the result as a variable:

Next, pass the variable to the DataFlow:

NOTE: the @int cast is because DataFlow supports int but pipeline's do not, so in the pipeline the data is stored in a string variable.

IN THE DATAFLOW:

Create an int parameter for "partitionCount", which is passed in from the pipeline:

SOURCE:

In the Optimize tab you can control how the source the data is partitioned on read. For this purpose, switch to "Set Partitioning" and select Round Robin based on the partitionCount variable:

This will split the incoming data into X number of buckets based on the parameter.

SINK:

Under the Settings tab, experiment with the "File name option" settings to control the output name. The options are a bit limited, so you may have trouble getting exactly what you want:

Since you have already partitioned the data, just use the default Source Optimization settings:

RESULT:

This will produce X number of files with a numbered naming scheme and consistent file size:

Did a similar approach. I exported a table to a `csv` file and then used the `get_metadata` activity in a pipeline to get the size of the file to define the number of partitions based on the desired size defined by request. This was a close alternative but got the issue when the table is too big and it's a unnecessary extra-step. Thanks for your time! — virtualdvid, Jul 20 '20 at 15:40

Azure Data Factory split file by file size

1 Answers1

Linked