Can I use parameterization to copy only specific columns from a number of parquet files

Question

I have a bunch of Salesforce objects being copied into an Azure Data Lake Storage Gen 2 as parquet files. I want to copy these into an Azure Synapse/Dedicated SQL Pool as tables, but I don't want every column, I only need a fraction of them. I know I can do a copy action for each parquet and use the mapping function to map the columns I want, but has anyone had success parametrizing this in some way? Each table is going to have different column names, but I'm thinking I could create a control document somewhere and look up the table and maybe a boolean field for if I want the column? I have a general idea that I could make this work, but I don't want to re-invent the wheel if someone is already doing it or if there's a better way I just don't know about. Thanks in advance for your help.

I know I can do a copy action for each parquet and use the mapping function to map the columns I want

You can try for the approach Like getting the details for the files in ADLS metadata as the child Iteams and then using the Get Metadata>For each>Inside the For Each Look Up> Copy activity.Using the data set data parameters — DileeprajnarayanThumula, Jun 08 '23 at 03:37

score 0 · Answer 1 · answered Jun 13 '23 at 08:34

As far as my understanding you can build your Synapse Pipeline Based on the metadata of the files existing in the ADLS Gen 2.

In my experience I have delt with the CSV file to dynamically copy the file to Azure synapse dedicated pool. Rather than hardcoding or Copying individual file to each table. This approach may help you. For example I have 2 csv files in ADLS EMPLOYEE.CSV and DEPARTMENTS.CSV

First I Do the manual import schema to get the json body for the columns only that I need(For desired columns only). enter image description here from the above Json you can get the TYPE key and Mapping key. which will point source and sink Column name. samefor the employee dataset. copy the Json body into a notepad. Now In the Azure synapse I am creating a config table called Table_Mapping. with below schema.

Create  table table_mapping
(
SourceFile Nvarchar(50),
SourceTableSchema NVARCHAR(50),
SinkaTableName NVARCHAR(50),
jsonMapping NVARCHAR(4000)
)

enter image description here The Json Mapping column consist of the above json body of the Desired columns only. Now Lets Build Dynamic Pipeline. Get metadata activity To know what file we have in ADLS it should point only folder NOT TO any file. For Each Inside FOR EACH LOOK UP In Look up we have the TABLE_Mapping so for each Iteration it will take the file name and table name and so on. enter image description here In the Query

select * from table_mapping where sourceFile='@{item().name}'

Next Copy activity Create Data set parameters for both Source and sink enter image description here here is the Expression for the Source and Sink dataset parameters Source dataset:

FileName: @activity('Lookup1').output.firstRow.SourceFile Sink dataset: Schema:

@activity('Lookup1').output.firstRow.SourceTableSchema tabelName:

@activity('Lookup1').output.firstRow.SinkaTableName

In the Mapping ADD DYNAMIC CONTENT with below expression.

@json(activity('Lookup1').output.firstRow.jsonMapping)

This approach of having a configuration Table for the columns can dynamically copy the data from source(ADLS) to Sink(Synapse dedicated SQL).

Can I use parameterization to copy only specific columns from a number of parquet files

1 Answers1