3

https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse. According this link with 1000 DWU and polybase I should get 200MBps throughput. But I am getting 4.66 MBps. I have added user in xlargerc resource class to achieve best possible throughput from azure sql datawarehouse.

Below is the Pipeline JSON.

                         {
              "name": "UCBPipeline-Copy",
                 "properties": {
                   "description": "pipeline with copy activity",
                 "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "SqlDWSink",
                        "allowPolyBase": true,
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    },
                    "cloudDataMovementUnits": 4
                },
                "inputs": [
                    {
                        "name": "USBBlob_Concept
                    }
                ],
                "outputs": [
                    {
                        "name": "AzureDW_Concept"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "AzureBlobtoSQLDW_Concept",
                "description": "Copy Activity"
            }
        ],
        "start": "2017-02-28T18:00:00Z",
        "end": "2017-03-01T19:00:00Z",
        "isPaused": false,
        "hubName": "sampledf1_hub",
        "pipelineMode": "Scheduled"
    }
}

Input dataset :

{
    "name": "AzureBlob_Concept",
    "properties": {
        "published": false,
        "type": "AzureBlob",
        "linkedServiceName": "AzureZRSStorageLinkedService",
        "typeProperties": {
            "fileName": "conceptTab.txt",
            "folderPath": "source/",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": "\t"
            }
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": true,
        "policy": {}
    }
}

output dataset:

{
    "name": "AzureDW_Concept",
    "properties": {
        "published": false,
        "type": "AzureSqlDWTable",
        "linkedServiceName": "AzureSqlDWLinkedService",
        "typeProperties": {
            "tableName": "concept"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        }
    }
}

is anything is missing in the configuration?

vidyak
  • 173
  • 4
  • 14
  • Just an observation, if your data is already in blob storage, you *could* just load it directly with Polybase / external table then CTAS, no Data Factory required. – wBob Mar 08 '17 at 10:33

1 Answers1

2

I took a look on runId "e98ac557-a507-4a6e-8833-978eff1723c3", which should belong to your Copy Activity. From our service logs, the source file is not large enough (270 MB in your case), so that the service call latency would make the throughput not good enough. You could try loading bigger files to have better throughput.

Yingqin
  • 195
  • 6
  • I tried with 9.16 GB of data . It took 47 minutes to process data with clouddatamovement unit set to 4 . output is : Data Read: 9.16 GB, Written: 0.00 Bytes, Data volume: 9.16 GB, Rows: 316498942, Throughput: 3.66 MB/s, Duration: 02:50:29 Source to staging blob runtime region: West US Staging blob to destination runtime region: West US Polybase used on sink: True – vidyak Mar 12 '17 at 11:32
  • Could you provide the runId of your mentioned new run? It seems that your new run was using a staged blob,which leads to the low throughput. – Yingqin Mar 13 '17 at 02:03
  • RunID for the above run is : 0bcf9cc7-e0c8-4e6b-bf7c-82174598271d – vidyak Mar 13 '17 at 06:09
  • You were using Staged Copy which has much lower throughput than Direct Copy with Polybase. Could you use direct copy instead? Refer to this doc: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-azure-sql-data-warehouse-connector#use-polybase-to-load-data-into-azure-sql-data-warehouse. – Yingqin Mar 14 '17 at 01:42
  • I had header in my CSV file . I had to remove it and remove property "skipHeaderLineCount": 1. because of the header it was going in to staged copy . Now I am able to get around 106 Mbps throughput with 1000 DWU . – vidyak Mar 16 '17 at 14:26