GCP Dataproc has Druid available in alpha. How to load segments?

Question

The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some google specific references to a bucket, but there are no examples how to do this.

What is the method to load data into Druid, running on GCP dataproc straight out of the box?

score 6 · Accepted Answer · answered Sep 27 '19 at 10:05

I haven't used Dataproc version of Druid, but have a small cluster running in Google Compute VM. The way I ingest data to it from GCS is by using Google Cloud Storage Druid extension - https://druid.apache.org/docs/latest/development/extensions-core/google.html

To enable extension you need to add it to a list of extension in your Druid common.properties file:

druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]

To ingest data from GCS I send HTTP POST request to http://druid-overlord-host:8081/druid/indexer/v1/task

The POST request body contains JSON file with ingestion spec(see ["ioConfig"]["firehose"] section):

{
    "type": "index_parallel",
    "spec": {
        "dataSchema": {
            "dataSource": "daily_xport_test",
            "granularitySpec": {
                "type": "uniform",
                "segmentGranularity": "MONTH",
                "queryGranularity": "NONE",
                "rollup": false
            },
            "parser": {
                "type": "string",
                "parseSpec": {
                    "format": "json",
                    "timestampSpec": {
                        "column": "dateday",
                        "format": "auto"
                    },
                    "dimensionsSpec": {
                        "dimensions": [{
                                "type": "string",
                                "name": "id",
                                "createBitmapIndex": true
                            },
                            {
                                "type": "long",
                                "name": "clicks_count_total"
                            },
                            {
                                "type": "long",
                                "name": "ctr"
                            },
                            "deleted",
                            "device_type",
                            "target_url"
                        ]
                    }
                }
            }
        },
        "ioConfig": {
            "type": "index_parallel",
            "firehose": {
                "type": "static-google-blobstore",
                "blobs": [{
                    "bucket": "data-test",
                    "path": "/sample_data/daily_export_18092019/000000000000.json.gz"
                }],
                "filter": "*.json.gz$"
            },
            "appendToExisting": false
        },
        "tuningConfig": {
            "type": "index_parallel",
            "maxNumSubTasks": 1,
            "maxRowsInMemory": 1000000,
            "pushTimeout": 0,
            "maxRetry": 3,
            "taskStatusCheckPeriodMs": 1000,
            "chatHandlerTimeout": "PT10S",
            "chatHandlerNumRetries": 5
        }
    }
}

Example cURL command to start ingestion task in Druid(spec.json contains JSON from the previous section):

curl -X 'POST' -H 'Content-Type:application/json' -d @spec.json http://druid-overlord-host:8081/druid/indexer/v1/task

How good is the throughput of `index_parallel`. Should I opt for `index_parallel` or `hadoop` if I have 2TB data stored in GCS ? — kaysush, Apr 08 '20 at 16:17

GCP Dataproc has Druid available in alpha. How to load segments?

1 Answers1