“We want to centralize data across our organization using a scalable solution with low maintenance requirements.”
This is the story of almost every data warehouse project. Using the power of the modern public cloud, it is increasingly realistic. IT organizations know this and have begun prioritizing database cloud migration projects over the last few years.
While increased scalability and lower maintenance costs drove the initial push to cloud data warehousing, the pace of migration is accelerating now due to the dramatic expansion upon the core functionalities of a traditional data warehouse. Cloud data warehouses bring enhanced performance for queries and data storage and enable easy data sharing across departments, regions, clients, or even the public. They also allow for resource auto-scaling in seconds, cloning, replications, in-house auto-ingestion, and more.
In this article, intended for a technical audience, we’ll share a detailed discussion of each of those benefits on Amazon Redshift, highlighting some of its best features and how you would benefit from including it as part of your organization’s data platform.
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse service within the AWS platform ecosystem that allows you to centralize all of your insightful data into a single data repository on the cloud.
Redshift started out as a PostgreSQL fork, but completely rewrote the storage engine to be columnar, made it an OLAP relational data store by adding analytics functions such as window operations, and added parallel processing (MPP) for endless scaling.
Redshift is fully integrated with other AWS services in the AWS ecosystem such as VPC’s, KMS and IAM for security, S3 for data lake integration and backups, EC2s for its cluster implementation, and CloudWatch for monitoring.
Redshift is unique because it's the only solution that is both a data warehouse and a data lake. As a matter of fact, AWS calls it a Data Lake House.
Redshift allows you to extend your queries to your Amazon S3 data lake without moving or transforming data. With Redshift Spectrum, you can query open file formats you already use, such as Avro, CSV, Grok, JSON, ORC, Parquet, and more, directly in S3. This gives you the flexibility to store highly structured, frequently accessed data in Redshift, keep exabytes of structured and unstructured data in S3, and query seamlessly across both to provide unique insights that you would not be able to obtain by querying independent datasets.
Redshift Spectrum is a powerful feature of Amazon Redshift that allows users to query data on S3 data lake, as if they were any other tables locally stored in your data warehouse cloud cluster. An S3 data lake has the potential to store exabytes of data, and with Spectrum, Amazon Redshift can query it all.
The external data (could be your data lake on S3 or even OLTP database on Aurora RDS) is queried by Redshift Spectrum locally, which means no data moves into Redshift. By doing so, Redshift Spectrum allows you to keep your data warehouse lean and enables the data lake house pattern out-of-the-box.
Redshift Spectrum allows SQL and BI apps to seamlessly reference external tables in queries as they do any other table within the Redshift cluster. Spectrum also supports complex joins, nested queries, and window functions on the external tables, which is very useful for advanced analysis.
Concurrency Scaling is one of the features that allows Redshift to scale storage and compute capacity independently for consistently fast query performance.
With Concurrency Scaling, whenever your Redshift cluster experiences a temporary burst of increased user activity, your cluster will automatically scale up with transient clusters to handle the increased concurrent workloads. Amazon Redshift automatically routes queries to scaling clusters, which are provisioned in seconds and begin processing queries immediately.
Redshift supports Data Modification Language (DML) commands such as INSERT, UPDATE, and DELETE, but it’s highly recommended to use COPY Command to load data into your Redshift cluster in order to take advantage of Redshift's parallel processing capabilities for better performance.
By doing so, you can incorporate live data as part of your business intelligence (BI) and reporting applications. In addition, it’s easier than ever to ingest data into a data warehouse by querying operational databases directly, applying transformations on the fly, and loading data into target tables without complex ETL pipelines.
Thanks to Redshift's advanced architecture, advanced features, and the cloud evolution, it’s now possible to implement a scalable Lake house solution within several weeks and enjoy all the benefits that cloud services can provide, all within the AWS ecosystem.
With fast go to market, Redshift can deliver value very quickly. Implementing a cloud data warehouse into your data platform allows you and the DWH users to store and analyze your data effectively and more quickly from all of your organization’s data sources.