0

I'm an analyst preparing Tableau reports with analysis for other teams. I would like to get some workload of my shoulders by creating a data source so optimized, that the users will be able to use it to get the data they need and do the analysis by themselves.

Current situation:

We use Amazon Redshift. We have tables with raw data coming directly from the systems. Also we have some transformed tables for easier work. All in all, it's tens and tens of tables. We are using Tableau desktop and Tableau server.

Desired situation:

I would like to retain access to the raw data so I can backtrack any potential issues back to the original source. From the raw data, I would like to create transformed tables that will allow users to make queries on them (two-layer system). The tables should contain all the data a user might need, yet be simple enough for a beginner-level SQL user.

I see two ways of approaching this:

  1. Small number of very large tables with all the data. If there are just a couple of tables that contain maximum amount of data, the user can just query one table and ask for columns he need. Or, if necessary, join one or two more tables to it.
  2. Many small and very specialized tables. User will have to do multiple joins to get the data he needs, but all the tables will be very simple so it will not be difficult.

Also, access permissions to the data need to be considered.

What do you think is a good approach to solving my issue? Is it any of the two above mentioned solutions? Do you have any other solution? What would you recommend?

Jan Horčička
  • 671
  • 1
  • 11
  • 26
  • for this I would set up a simple data warehouse backed by a data lake to hold the raw data. i would use airflow to orchestrate the jobs to make this happen. I would consider redshift spectrum and hold the raw data on s3. – Jon Scott Mar 29 '18 at 15:01

1 Answers1

0

We had this problem and we sorted out with AWS Athena. You pay only when the data is scanned and used. Otherwise, you will not pay and no data will be touched.

With AWS Athena you can create any set of tables with different attributes from and easy to maintain the Role permissions.

Last part to cover, Tableau has a direct interface to Athena, so no need for any intermediate storage.

Also any time you don't want the table, just delete and remove from Roles. Rest of them will be automatically taken care.

On an Additional Note, we tried Redshift Spectrum on JSON data, it does not work with nested JSON yet. So all your attributes should be only one level deep.

Hope it helps.

EDIT1:

Redshift is a columnar database, there is no difference between small tables and big tables. If you can avoid joins with smaller tables. Even if the table is bigger, your query speed depends upon the fields involved in the query. If a field is not required in the query it is never touched when you are querying the data.

I prefer to have all related data in a bigger table so need to duplicate any relations or joins to the tables.

Also you need to ensure there is not much duplication of data when you store in a bigger table.

More about Database Normalization,

MySQL: multiple tables or one table with many columns?

Kannaiyan
  • 12,554
  • 3
  • 44
  • 83
  • spectrum is similar to athena, depends on your usage as to which is best for you. you get more flexiblilty using spectrum. also - see https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-redshift-spectrum-now-supports-scalar-json-and-ion-data-types/ – Jon Scott Mar 30 '18 at 09:08
  • We have the data in Amazon Redshift and we want to keep it that way. The question was more on how to optimally structure the tables to have it most user friendly for non-tech users so they can use it as a self-service. – Jan Horčička Apr 12 '18 at 14:28
  • @JanHorčička Sorry I did not address that part. Updated the answer. Preference over big tables on a columnar database is preferrable. – Kannaiyan Apr 12 '18 at 15:19