8

I am not able to understand the concept of groupBy/groupById and windowing in kafka streaming. My goal is to aggregate stream data over some time period (e.g. 5 seconds). My streaming data looks something like:

{"value":0,"time":1533875665509}
{"value":10,"time":1533875667511}
{"value":8,"time":1533875669512}

The time is in milliseconds (epoch). Here my timestamp is in my message and not in key. And I want to average the value of 5 seconds window.

Here is code that I am trying but it seems I am unable to get it work

builder.<String, String>stream("my_topic")
   .map((key, val) -> { TimeVal tv = TimeVal.fromJson(val); return new KeyValue<Long, Double>(tv.time, tv.value);})
   .groupByKey(Serialized.with(Serdes.Long(), Serdes.Double()))
   .windowedBy(TimeWindows.of(5000))
   .count()
   .toStream()
   .foreach((key, val) -> System.out.println(key + " " + val));

This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like

[1533877059029@1533877055000/1533877060000] 1
[1533877061031@1533877060000/1533877065000] 1
[1533877063034@1533877060000/1533877065000] 1
[1533877065035@1533877065000/1533877070000] 1
[1533877067039@1533877065000/1533877070000] 1

This output does not make sense to me.

Related code:

public class MessageTimeExtractor implements TimestampExtractor {
    @Override
    public long extract(ConsumerRecord<Object, Object> record,  long previousTimestamp) {
        String str = (String)record.value();
        TimeVal tv = TimeVal.fromJson(str);
        return tv.time;
    }
}

public class TimeVal
{
    final public long time;
    final public double value;
    public TimeVal(long tm, double val) {
        this.time = tm;
        this.value = val;
    }
   public static TimeVal fromJson(String val) {
       Gson gson = new GsonBuilder().create();
       TimeVal tv = gson.fromJson(val, TimeVal.class);
       return tv;
   }
}

Questions:

Why do you need to pass serializer/deserializer to group by. Some of the overloads also take ValueStore, what is that? When grouped, how the data looks in the grouped stream?

How window stream is related to group stream?

The above, I was expecting to print in streaming way. That means buffer for every 5 seconds and then count and then print. It only prints once press Ctrl+c on command prompt i.e. it prints and then exits

Beryllium
  • 12,808
  • 10
  • 56
  • 86
x64
  • 332
  • 1
  • 4
  • 13

2 Answers2

13

It seems you don't have keys in your input data (correct me if this is wrong), and it further seems, that you want to do global aggregation?

In general, grouping is for splitting a stream into sub-streams. Those sub-streams are build by key (ie, one logical sub-stream per key). You set your timestamp as key in your code snippet an thus generate a sub-stream per timestamps. I assume this is not intended.

If you want to go a global aggregation, you will need to map all record to a single substream, ie, assign the same key to all records in groupBy(). Note, that global aggregations don't scale as the aggregation must be computed by a single thread. Thus, this will only work for small workloads.

Windowing is applied to each generated sub-stream to build the windows, and the aggregation is computed per window. The windows are build base on the timestamp returned by the Timestamp extractor. It seems you have an implementation that extracts the timestamp for the value for this purpose already.

This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like

By default, Kafka Streams uses some internal caching and the cache will be flushed on commit -- this happens every 30 seconds by default, or when you stop your application. You would need to disable caching to see result earlier (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html)

Why do you need to pass serializer/deserializer to group by.

Because data needs to be redistributed and this happens via a topic in Kafka. Note, that Kafka Streams is build for a distributed setup, with multiple instances of the same application running in parallel to scale out horizontally.

Btw: we might also be interesting in this blog post about the execution model of Kafka Streams: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • I do have key in my input data. It is the "time" in the example messages. Regarding the buffering, if I do not do groupBy and windowedBy and count, then it keeps printing. So I do not believe there is cashing issue. Also, as I understand from this answer, there is no good way to achieve what I want. But I believe that I have a very simple use case. – x64 Aug 10 '18 at 06:06
  • "time" seems not to be a good fit for a key... Note, that the key is used for data partitioning and parallel processing. Also, caching applies to KTables only -- if you don't aggregate, there is no caching ;) "there is no good way to achieve what I want" -- would not say that, but I am not 100% sure what you what and what your requirements are. – Matthias J. Sax Aug 10 '18 at 14:54
  • Here is my use case. I have sensor data coming from device into a topic. I have one topic for one sensor. The sensor data contains "timestamp" and "value". Now I want to average the sensor data every N seconds and put into another topic which will be sinked database. – x64 Aug 10 '18 at 19:58
  • It might be a better design, to have one topic for all sensors and put a sensor ID as key? Than, you can read this topic, use the sensor ID as key, apply a N second window, compute the average and write the result to a topic. – Matthias J. Sax Aug 10 '18 at 21:25
  • There is a problem in having same topic for all sensors. In the application, there is a client (consumer) which may be interested in only sensor at a time. It will unnecessarily get flooded with all sensors data (at least, it will get flooded with all messaged in a partition) which is not desirable. – x64 Aug 11 '18 at 21:18
  • I ended up reading the kafka stream code (https://github.com/apache/kafka/tree/b9f11796944056a5b3c7440033d587d19290b3c3/streams/src/main/java/org/apache/kafka/streams) a bit and now got fair understanding of what the above code does. I think, the problem is that the there is no concept of watermark in kafka stream. Otherwise I would have achieved easily of what I wanted. Probably I will have to write my own processor class. – x64 Aug 11 '18 at 21:25
  • You can still go with a dummy key and do the aggregation. I am not sure how watermark would be related... – Matthias J. Sax Aug 12 '18 at 02:05
0

It seems like you misunderstand the nature of window DSL.

It works for internal message timestamps handled by kafka platform, not for arbitrary properties in your specific message type that encode time information. Also, this window does not group into intervals - it is a sliding window. It means any aggregation you get is for the last 5 seconds before the current message.

Also, you need the same key for all group elements to be combined into the same group, for example, null. In your example key is a timestamp which is kind of entry-unique, so there will be only a single element in a group.

Ivan Klass
  • 6,407
  • 3
  • 30
  • 28