8

Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it? I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp

e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....

If the functionality is not supported, are there any plug-ins? Any dirty solutions?

Štefan Šoóš
  • 159
  • 1
  • 8

4 Answers4

8

Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.

DataSource Schema

First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts

here's the data schema I used:

"dataSchema": {
  "dataSource": "ds1",
  "parser": {
    "parseSpec": {
      "format": "json",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
            "user_id",
            "platform",
            "register_time"
        ],
        "dimensionExclusions": [],
        "spatialDimensions": []
      }
    }
  },
  "metricsSpec": [
    { "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
  ],
  "granularitySpec": {
    "type": "uniform",
    "segmentGranularity": "HOUR",
    "queryGranularity": "DAY",
          "intervals": ["2015-01-01/2017-01-01"]
  }
},

so the sample data should look something like (each record is login event):

{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}

as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .

Aggregating

And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function

Query looking like that:

{
    "intervals": "2016-01-20/2016-07-01",
    "dimensions": [
        {
            "type": "extraction",
            "dimension": "register_time",
            "outputName": "reg_date",
            "extractionFn": {
                "type": "timeFormat",
                "format": "YYYY-MM-dd",
                "timeZone": "Europe/Bratislava" ,
                "locale": "sk-SK"
            }
        }
    ],
    "granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
    "aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
    "dataSource": "ds1",
    "queryType": "groupBy"
}

Filtering

Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:

{
    "intervals": "2016-01-20/2016-07-01",
    "dimensions": [
        "platform",
        {
            "type": "extraction",
            "dimension": "register_time",
            "outputName": "reg_date",
            "extractionFn": {
                "type": "javascript",
                "function": "function(x) {return Date.parse(x)/1000}"
            }
        }
    ],
    "granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
    "aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
    "dataSource": "ds1",
    "queryType": "groupBy"
    "filter": {
        "type": "bound",
        "dimension": "register_time",
        "outputName": "reg_date",
        "alphaNumeric": "true"
        "extractionFn": {
            "type": "javascript",
            "function": "function(x) {return Date.parse(x)/1000}"
        }
    }
}

I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.

Štefan Šoóš
  • 159
  • 1
  • 8
  • Do you know how to perform a SELECT in the results returned from a GROUP BY Druid? http://stackoverflow.com/questions/38546593/how-to-perform-a-select-in-the-results-returned-from-a-group-by-druid –  Jul 23 '16 at 21:35
  • @Štefan, How do you pass register time to druid ? I'm pretty much interested in your architecture! Are you getting register time every time when a user visits from one of your services and appending to the event and sending it back? Or you have multiple topics where you manipulate the event and add register time and send to another kafka topic from which druid listening from ? – gwthm.in Apr 14 '17 at 07:28
3

Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering. I guess you might be able to hack it that way. Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.

Slim Bouguerra
  • 359
  • 1
  • 8
0

Another solution is add a longMin sort of metric for the timestamp and store the epoch time in that field or you convert the date time to a number and store it (eg 31st March 2021 08:00 to 310320210800)

100MIL
  • 81
  • 1
  • 8
0

As for Druid 0.22 it is stated in the documentation that secondary timestamps should be handled/parsed as dimensions of type long. Secondary timestamps can be parsed to longs at ingestion time with a tranformSpec and be transformed back if needed in querying time link.

Zisis F
  • 322
  • 5
  • 11