6

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an architecture that is talked about and recommended in many places:

However, I am struggling to understand how to use DynamoDB for the purposes of querying the event data in S3. In the link to the AWS blog above, they use the example of storing customer events produced by multiple different servers:

S3 path format: [4-digit hash]/[server id]/[year]-[month]-[day]-[hour]-[minute]/[customer id]-[epoch timestamp].data

Eg: a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data

And the schema to record this event in DynamoDB looks like:

Customer ID (Partition Key), Timestamp-Server (Sort Key), S3-Key, Size
87423, 1436055953839-i-31cc02, a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data, 1234

I would like to perform a query such as: "Get me all the customer events produced by all servers in the last 24 hours" but as far as I understand, it's impossible to efficiently query DynamoDB without using the partition key. I cannot specify the partition key for this kind of query.

Given this requirement, should I use a database other than DynamoDB to record where my events are in S3? Or do I simply need to use a different type of DynamoDB schema?

Alex Spurling
  • 54,094
  • 23
  • 70
  • 76
  • The architecture looks fine. However, you can't query the DynamoDB database without partition key. You have to scan the whole DynamoDB if partition key is not available. The alternative would be to create the Global Secondary Index on the datetime field. – notionquest Nov 10 '16 at 15:19
  • @notionquest thanks. Could you expand on what you mean by Global Secondary Index and how it would help here? – Alex Spurling Nov 10 '16 at 15:21
  • 1
    how about using elasticsearch to index the metadat? Take a look at the link below: [indexing-metadata-in-amazon-elasticsearch-service-using-aws-lambda-and-python](https://aws.amazon.com/blogs/database/indexing-metadata-in-amazon-elasticsearch-service-using-aws-lambda-and-python/) – Payman Jan 23 '17 at 21:26

2 Answers2

3

The architecture looks fine and feasible using DynamoDB database. The DynamoDBMapper class (present in AWS SDK Java) can be used to create the model which has useful methods to get the data from S3.

DynamoDBMapper

getS3ClientCache() Returns the underlying S3ClientCache for accessing S3.

DynamoDB database can't be queried without partition key. You have to scan the whole DynamoDB database if partition key is not available. However, you can create a Global Secondary Index (GSI) on date/time field and query the data for your use case.

In simple terms, GSI is similar to the index present in any RDBMS. The difference is that you can directly query the GSI rather than the main table. Normally, GSI is required if you would like to query the DynamoDB for some use case when partition key is not available. There are options available to include ALL (or) selective fields present in the main table in GSI.

Global Secondary Index (GSI)

Difference between Scan and Query in DynamoDB

Yes, in this use case, looks like GSI can't help as the use case requires a RANGE query on partition key. The DynamoDB supports only equality operator. DynamoDB supports range queries on sort keys or other non-key attributes if partition key is available. You may have to scan the DynamoDB to fulfill this use case which is costly operation.

Either you have think about alternate data model where you can query by partition key or use some other database.

notionquest
  • 37,595
  • 6
  • 111
  • 105
  • The GSI docs say: "Every global secondary index must have a partition key, and can have an optional sort key." I think this means it won't solve the problem of efficiently retrieving records using a time range? – Alex Spurling Nov 10 '16 at 15:38
  • In the above use case, Timestamp-Server should be the partition key of GSI. – notionquest Nov 10 '16 at 15:41
  • If I use the timestamp as a partition key, then I also need to specify a value for it when I query on the index. Again, from the docs: "You need to specify the index name, the query criteria for the index partition key and sort key (if present)". Again does this mean I cannot do efficient range queries? – Alex Spurling Nov 10 '16 at 15:48
  • Agreed, GSI can't help here. Updated my answer. – notionquest Nov 10 '16 at 16:03
2

First, I've read that same AWS blog page too: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/

The only way you can make this work with DynamoDB is:

  • add another attribute called "foo" and put same value 1 for all items
  • add another attribute called "timestamp" and put epoch timestamp there
  • create a GSI with partition key "foo" and range key "timestamp", and project all other attributes

Looks a bit dirty, huh? Then you can query items the for last 24 hours with partition key 1 (all items have 1) and use that timestamp range key. Now, the problems:

  1. GSI having all items with same partition key? Performance will suck if data gorws large
  2. Costs more with a GSI

You should think about the costs as well. Think about your data ingestion rate. Putting 1000 objects per second in a bucket would costs you about $600 per month and $600 more with GSI. Just because of that query need (last 24 hrs), you have to spend $600 more.

I'm encountering the same problems for designing this meta data index. DynamoDB just doesn't look right. This is always what you get when you try to use DynamoDB in a way you would use a RDBMS. Because I have few querying needs like yours. I thought about ElasticSearch and the s3 listing river plugin, and it doesn't look good either since I have to manage ES clusters and storage. What about CloudSearch? Looking at its limits, CloudSearch doesn't fell right either.

My requirements:

  1. be able to access the most recent object with a given prefix
  2. be able to access objects within a specific time range
  3. get maximum performance out of S3 by hash strings in key space for AWS EMR, Athena or Redshift Spectrum

I am all lost here. I even thought about S3 versioning feature since I can get the most recent object just naturally. All seems not quite right and AWS documents and blog articles are full of confusions.

This is where I'm stuck for the whole week :(

People at AWS just love drawing diagrams. When they introduce some new architecture scheme or concept, they just put a bunch of AWS product icons there and say it's beautifully integrated.

gini09
  • 259
  • 3
  • 13
  • I even thought about putting epoch timestamp in the object keys as in binary number format. e.g. 4238429332 would be like "111011010101010101010101". then you can get LIST with a certain prefix which will give you a specific time range. Guess what? S3 get LIST requests are much more expensive than DynamoDB read provisioning price. if you can somehow use the result all up to the limit of 1000 objects, it could make sense, but that wasn't my case. – gini09 May 11 '17 at 07:12
  • it looks like AWS is making a new feature for S3. there is a team called "S3 indexing team" and they're hiring... https://www.amazon.jobs/en/jobs/468608 – gini09 May 11 '17 at 08:37