6

I'm new to Redshift and having some clarification on how Redshift operates:

Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?

In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?

Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.

This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Durgaprasad
  • 159
  • 2
  • 9

2 Answers2

11

Yes, Amazon Redshift uses its own storage.

The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".

Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.

The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.

Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.

The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.

Quick summary:

  • If you need a database for an application, use Amazon RDS
  • If you are building a data warehouse, use Amazon Redshift
  • If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 1
    John I really like your full answer here, and would likely refer to it for other answers in future. For Athena/Spectrum, generally you really should only be using that for transactional and immutable data , and not for data that is often updated. If you agree maybe you could update your answer to include something about that. Also maybe you should consider AWS data lake recommendations which in essence is a hybrid. – Jon Scott May 08 '20 at 22:34
1

looking at your question, you may benefit from professional help with your architecture.

However to get you started, Redshift::

  • has its own data storage, no link to s3.
  • Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS Athena)
  • is not a good alternative as a back-end database to replace a traditional RDBMS as transactions are very slow.
  • is a great data warehouse tool, just use it for that!
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Jon Scott
  • 4,144
  • 17
  • 29