3

I am really struggling to understand how Dynamo / ElasticSearch should be used to support AWS data lake efforts (Metadata / Catalogs). It seems as though you would log the individual S3 locations of your zip archives for your sources in Dynamo and any additional metadata / attributes you would like to search by in ES. If that is correct, how would you use the two together to support that. I tried to find more detailed information about how to properly pair the two together, but have been unsuccessful. Any information / documentation that others have would be great. Good chance I am overlooking some obvious examples / documentation.

What I am imagining is something like the following:

  • User could search for metadata / attributes in ES that would point to the high-level S3 buckets / partitions that match.
  • The search in DynamoDB would be against the part of the key (Partition / bucket) from the ES result
  • The search would most likely result in many individual objects / keys that could then be processed, extracted, etc.
scarpacci
  • 8,957
  • 16
  • 79
  • 144
  • Yes, that sounds right. Use each service for what it does best. S3 - reliable storage. DynamoDB - fast searching on partition keys. Elasticsearch - fast accurate search-ability. You would just share a unique ID (uuid) across all 3 services to link the record together. – John Veldboom Oct 09 '17 at 18:53

1 Answers1

2

I spoke to one of our AWS reps, who referred me to this article. It was a great starting point. AWS Data Lake. This seemed to answer some of my questions about the user of components and approach, that was previously unclear to me.

Highlights:

  • Blueprint for implementing a data lake. Combining S3 / DynamoDB / ES is common.
  • There are many variations to the implementation. Substituting an RDS for ES / DynamoDB, using just ES, etc.
  • We will most likely start with an RDS to workout the process, then move to DyanmoDB / ES.
scarpacci
  • 8,957
  • 16
  • 79
  • 144