1

We are trying to implement a clickstream flow of our e-commerce on AWS. The clickstream will catch all actions done by 'Anonymous' users. Anonymous users are tracked via a UUID , generated during their first visit , that is stored in a cookie. We used the AWS example here to suggest a solution architecture like the diagram below :

enter image description here

Now 2 questions :

  1. Different pages in the e-commerce have different clickstream data. For example on the Item view page , we would like to send Item related info such as itemId as well. Or on Checkout page , we would like to have few order related info tied to the clickstream data. Should we have separate Firehose delivery streams for different pages to support custom clickstream data? Or we should send a generic clickstream record (with possible null values for some attributes) to a FH delivery stream?

  2. At some point our anonymous users become identified (ex. they login so we know their User_ID) So we would like to link the {UUID and User_ID} to be able to have a customer 360 view. Should we consider a separate stream flow + separate S3 bucket for tracking UUID+ User_ID mappings? Should we then use Athena for showing aggregated reports for customer 360? Should we aggregate the data and create a customer dimension in the Redshift? What would be a good solution for this?

Regards, Lina

[Update]: Is the following diagram an acceptable solution for the question? enter image description here

Lina
  • 1,217
  • 1
  • 15
  • 28

1 Answers1

1

You should make the decision based on how you intend to access the data. Given the rate at which clickstream data grows, if you want to generate any meaningful insights on the data with a reasonable response time and cost, you would want to make use of data partitioning. Read more about this here.

To be able to reliably do that you will have to use multiple Kinesis streams.

The only scenario in which you would choose to not use multiple streams is due to cost. But given that you will be using it in a clickstream application, and if you are using it on a website with active users, the number of incoming events can be easily used to effectively use the shards.

Disclaimer: Personal Opinion: I would suggest that you move this to Kinesis Firehose just so that you will have the flexibility to start loading the data into redhift with minimal process changes at later stage while at the same time also backup the data in S3 for cold storage/backup. Given the volume, Athena might not be a good choice for performing analytical queries on the data. You can look at using Redhift external tables, wherein the data still lies on S3. As for the cost of the redshift instance itself, you can now pause the cluster. Read the announcement here.

To address the updated architecture diagram that you have added, you can completely skip Glue. Kinesis can directly load the data to S3 and you can define external tables with RedShift spectrum.

The general approach is to load the data into Redshift as well as back it up to S3. Then on the Redshift you can periodically delete old data (say more than a year ago). This balances the cost vs performance as the query would be more performance on the data lying with the Redshift.

As for transformations, you can directly make use of Lambdas with the Kinesis Firehose. Read more about this here.

Edit 1: Added the opinion around using Redshift and why it will be useful and cost effective

Edit 2: Added details around simplifying the new proposed architecture.

Mayank Raj
  • 1,574
  • 11
  • 13
  • Thanks @Mayank . But we are not using Kinesis Data streams , we are not interested in real-time analytics at the moment.. Will be using Kinesis Firehose... So there will be no shards there.. – Lina Jul 14 '20 at 09:57
  • The same rationale applies for that as well. It's the same task of finding a sweet spot between cost vs performance right. Also, I would suggest that you give firehose a try. I've updated my answer for reason behind why I say that. – Mayank Raj Jul 14 '20 at 10:04
  • Thank Myanak... I just updated the questions... So is the second diagram an acceptable solution for the question? (lets say we have the s3 partitioning and the file format in the glue will be parquet) The Glue will be used in Redshift Spectrum to create external tables and then we will be able to join those tables and get a final customer 360 view table... will that work? – Lina Jul 14 '20 at 12:15
  • 1
    @Lina, I've updated the answer with how you can further optimize the architecture. If the answer did indeed address your question, do you mind accepting it. That way, others who stumble upon this can easily find the solution. Cheers. – Mayank Raj Jul 14 '20 at 12:32