4

Question

I've read this and this and this articles. But they provide contradictory answers to the question: how to customize partitioning on ingesting data to S3 from Kinesis Stream?

More details

Currently, I'm using Firehose to deliver data from Kinesis Streams to Athena. Afterward, data will be processed with EMR Spark.

From time to time I have to handle historical bulk ingest into Kinesis Streams. The issue is that my Spark logic hardly depends on data partitioning and order of event handling. But Firehouse supports partitioning only by ingestion_time (into Kinesis Stream), not by any other custom field (I need by event_time).

For example, under Firehouse's partition 2018/12/05/12/some-file.gz I can get data for the last few years.

Workarounds

Could you please help me to choose between the following options?

  1. Copy/partition data from Kinesis Steam with help of custom lambda. But this looks more complex and error-prone for me. Maybe because I'm not very familiar with AWS lambdas. Moreover, I'm not sure how well it will perform on bulk load. At this article it was said that Lambda option is much more expensive than Firehouse delivery.
  2. Load data with Firehouse, then launch Spark EMR job to copy the data to another bucket with right partitioning. At least it sounds simpler for me (biased, I just starting with AWS Lambas). But it has the drawback of double-copy and additional spark Job.

At one hour I could have up to 1M rows that take up to 40 MB of memory (at compressed state). From Using AWS Lambda with Amazon Kinesis I know that Kinesis to Lambda event sourcing has a limitation of 10,000 records per batch. Would it be effective to process such volume of data with Lambda?

varnit
  • 1,859
  • 10
  • 21
VB_
  • 45,112
  • 42
  • 145
  • 293

2 Answers2

1

While Kinesis does not allow you to define custom partitions, Athena does!

The Kinesis stream will stream into a table, say data_by_ingestion_time, and you can define another table data_by_event_time that has the same schema, but is partitioned by event_time.

Now, you can make use of Athena's INSERT INTO capabilities to let you repartition data without needing to write Hadoop or a Spark job and you get the serverless scale-up of Athena for your data volume. You can use SNS, cron, or a workflow engine like Airflow to run this at whatever interval you need.

We dealt with this at my company and go in-to more depth details of the trade-offs of using EMR or a streaming solution, but now you don't need to introduce anymore systems like Lambda or EMR.

https://radar.io/blog/custom-partitions-with-kinesis-and-athena

J Kao
  • 2,023
  • 2
  • 15
  • 16
0

you may use the kinesis stream, and create the partitions like you wants. you create a producer, and in your consumer, create the partitions. https://aws.amazon.com/pt/kinesis/data-streams/getting-started/

Luan Oliveira
  • 338
  • 1
  • 10
  • I know that i is possible to do. The answer should contain consierations why one option is better/worse than another. I do not see how referenced resource ( "Kinesis Getting Started") is answering my question – VB_ Dec 05 '18 at 18:17