6

I am using AWS-Kinesis-Firehose to injest data to S3, and consume it afterwards with Athena.

I am trying to analyze events from different games, to avoid Athena explore much data I would like to partition the s3 data using an identifier for each game, so far I did not find a solution, as Firehose receives data from different games.

Does anyone knows how to do it?

Thank you, Javi.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
bracana
  • 1,003
  • 2
  • 11
  • 20
  • it would be better to add the code you have so far – ammportal Aug 01 '17 at 08:07
  • Why is this question marked as duplicate? Its a valid and very much different question. Its an error to mark it as duplicate. This question asks how to create custom partition based on value of the kinesis stream. The question supposed to be duplicate talks about parquet files, both are completely different. Kinesis can work without transformation too. Please remove duplicate mark. – suresh Sep 11 '19 at 13:49

2 Answers2

3

You could possibly use Amazon Kinesis Analytics to split incoming Firehose streams into separate output streams based upon some logic, such as Game ID.

It can accept a KinesisFirehoseInput and send data to a KinesisFirehoseOutput.

However, the limits documentation seems to suggest that there can only be 3 output destinations per application, so this would not be sufficient.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
2

You could send your traffic to the main FireHose stream - then use a lambda function to split the data to multiple FireHose streams - one for each game that will save the data in a separate folder/bucket

Shimon Tolts
  • 1,602
  • 14
  • 15
  • 1
    I thought about this, but there is a problem, I expect to have around 20 Million daily events, that means that the lambda function will be triggered 20M times a day just to "classify" the events, that would be expensive. – bracana Aug 01 '17 at 08:57
  • I have found the cloudWatch events, that could help me by doing it in a time intervals, but still could be expensive – bracana Aug 01 '17 at 09:04
  • 1
    Lambda can be triggered as a batch up to 1000 events via FireHose - which will reduce your triggers dramatically – Shimon Tolts Aug 01 '17 at 11:03
  • I have already done as you said before, by assigning a lambda function to a firehose stream, as I have configured a batch timing of 300 secs, it is enough to fill my needs. The batch upon events is not valid for me, as I can receive many events in a short period of time or a few, I do not want to wait until I have received 1000 events. Thank you very much for your help!! – bracana Aug 01 '17 at 11:17
  • Please note that the batch setting is "up to X events" it will not hold your stream until it reached the limit – Shimon Tolts Aug 01 '17 at 13:06