Speed up question efficiency with Apache Iceberg statistics on the AWS Glue Information Catalog

0
1


At the moment, we’re happy to announce a brand new functionality for the AWS Glue Information Catalog: producing column-level aggregation statistics for Apache Iceberg tables to speed up queries. These statistics are utilized by cost-based optimizer (CBO) in Amazon Redshift Spectrum, leading to improved question efficiency and potential price financial savings.

Apache Iceberg is an open desk format that gives the aptitude of ACID transactions in your information lakes. It’s designed to course of giant analytics datasets and is environment friendly for even small row-level operations. It additionally permits helpful options resembling time-travel, schema evolution, hidden partitioning, and extra.

AWS has invested in service integration with Iceberg to allow Iceberg workloads primarily based on buyer suggestions. One instance is the AWS Glue Information Catalog. The Information Catalog is a centralized repository that shops metadata about your group’s datasets, making the info seen, searchable, and queryable for customers. The Information Catalog helps Iceberg tables and tracks the desk’s present metadata. It additionally permits automated compaction of particular person small information produced by every transactional write on tables into a number of giant information for sooner learn and scan operations.

In 2023, the Information Catalog introduced help for column-level statistics for non-Iceberg tables. That function collects desk statistics utilized by the question engine’s CBO. Now, the Information Catalog expands this help to Iceberg tables. The Iceberg desk’s column statistics that the Information Catalog generates are primarily based on Puffin Spec and saved on Amazon Easy Storage Service (Amazon S3) with different desk information. This fashion, numerous engines supporting Iceberg can make the most of and replace them.

This publish demonstrates how column-level statistics for Iceberg tables work with Redshift Spectrum. Moreover, we showcase the efficiency good thing about the Iceberg column statistics with the TPC-DS dataset.

How Iceberg desk’s column statistics works

AWS Glue Information Catalog generates desk column statistics utilizing the Theta Sketch algorithm on Apache DataSketches to estimate the variety of distinct values (NDV) and shops them in Puffin file.

For SQL planners, NDV is a crucial statistic to optimize question planning. There are a number of eventualities the place NDV statistics can doubtlessly optimize question efficiency. For instance, when becoming a member of two tables on a column, the optimizer can use the NDV to estimate the selectivity of the be a part of. If one desk has a low NDV for the be a part of column in comparison with the opposite desk, the optimizer might select to make use of a broadcast be a part of as an alternative of a shuffle be a part of, lowering information motion and enhancing question efficiency. Furthermore, when there are greater than two tables to be joined, the optimizer can estimate the output measurement of every be a part of and plan the environment friendly be a part of order. Moreover, NDV can be utilized for numerous optimizations resembling group by, distinct, and depend question.

Nonetheless, calculating NDV repeatedly with 100% accuracy requires O(N) area complexity. As an alternative, Theta Sketch is an environment friendly algorithm that lets you estimate the NDV in a dataset while not having to retailer all of the distinct values on reminiscence and storage. The important thing concept behind Theta Sketch is to hash the info into a variety between 0–1, after which choose solely a small portion of the hashed values primarily based on a threshold (denoted as θ). By analyzing this small subset of knowledge, the Theta Sketch algorithm can present an correct estimate of the NDV within the unique dataset.

Iceberg’s Puffin file is designed to retailer info resembling indexes and statistics as a blob kind. One of many consultant blob varieties that may be saved is apache-datasketches-theta-v1, which is serialized values for estimating the NDV utilizing the Theta Sketch algorithm. Puffin information are linked to a snapshot-id on Iceberg’s metadata and are utilized by the question engine’s CBO to optimize question plans.

Leverage Iceberg column statistics by Amazon Redshift

To exhibit the efficiency good thing about this functionality, we make use of the industry-standard TPC-DS 3 TB dataset. We examine the question efficiency with and with out Iceberg column statistics for the tables by operating queries in Redshift Spectrum. We have now included the queries used on this publish, and we suggest attempting your personal queries by following the workflow.

The next is the general steps:

  1. Run AWS Glue Job that extracts TPS-DS dataset from Public Amazon S3 bucket and saves them as an Iceberg desk in your S3 bucket. AWS Glue Information Catalog shops these tables’ metadata location. Question these tables utilizing Amazon Redshift Spectrum.
  2. Generate column statistics: Make use of the improved capabilities of AWS Glue Information Catalog to generate column statistics for every tables. It generates puffin information storing Theta Sketch.
  3. Question with Amazon Redshift Spectrum: Consider the efficiency good thing about column statistics on question efficiency by using Amazon Redshift Spectrum to run queries on the dataset.

The next diagram illustrates the structure.

To do this new functionality, we full the next steps:

  1. Arrange sources with AWS CloudFormation.
  2. Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset in your S3 bucket. The Information Catalog shops these tables’ metadata location.
  3. Run queries on Redshift Spectrum and word the question length.
  4. Generate Iceberg column statistics for Information Catalog tables.
  5. Run queries on Redshift Spectrum and examine the question length with the earlier run.
  6. Optionally, schedule AWS Glue column statistics jobs utilizing AWS Lambda and an Amazon EventBridge

Arrange sources with AWS CloudFormation

This publish features a CloudFormation template for a fast setup. You may overview and customise it to fit your wants. Word that this CloudFormation template requires a area with at the very least 3 Availability Zones. The template generates the next sources:

  • A digital personal cloud (VPC), public subnet, personal subnets, and route tables
  • An Amazon Redshift Serverless workgroup and namespace
  • An S3 bucket to retailer the TPC-DS dataset, column statistics, job scripts, and so forth
  • Information Catalog databases
  • An AWS Glue job to extract the TPS-DS dataset from the general public S3 bucket and save the info as an Iceberg desk in your S3 bucket
  • AWS Identification and Entry Administration (AWS IAM) roles and insurance policies
  • A Lambda operate and EventBridge schedule to run the AWS Glue column statistics on a schedule

To launch the CloudFormation stack, full the next steps:

  1. Register to the AWS CloudFormation console.
  2. Select Launch Stack.
  3. Select Subsequent.
  4. Go away the parameters as default or make acceptable modifications primarily based in your necessities, then select Subsequent.
  5. Evaluation the small print on the ultimate web page and choose I acknowledge that AWS CloudFormation would possibly create IAM sources.
  6. Select Create.

This stack can take round 10 minutes to finish, after which you’ll view the deployed stack on the AWS CloudFormation console.

Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset

When the CloudFormation stack creation is full, run the AWS Glue job to create Iceberg tables for the TPC-DS dataset. This AWS Glue job extracts the TPC-DS dataset from the general public S3 bucket and transforms the info into Iceberg tables. These tables are loaded into your S3 bucket and registered to the Information Catalog.

To run the AWS Glue job, full the next steps:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Select InitialDataLoadJob-<your-stack-name>.
  3. Select Run.

This AWS Glue job can take round half-hour to finish. The method is full when the job processing standing reveals as Succeeded.

The AWS Glue job creates tables storing the TPC-DS dataset in two an identical databases: tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats can have no generated statistics, and we use them as reference. We generate statistics on tables in tpcdsdbwithstats. Affirm the creation of these two databases and underlying tables on the AWS Glue console. Presently, these databases maintain the identical information and there aren’t any statistics generated on the tables.

Run queries on Redshift Spectrum with out statistics

Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables (which at the moment don’t have statistics).

To run your question in Amazon Redshift, full the next steps:

  1. Obtain the Amazon Redshift queries.
  2. Within the Redshift question editor v2, run the queries listed within the Redshift Question for tables with out column statistics part within the downloaded file redshift-tpcds-sample.sql.
  3. Word the question runtime of every question.

Generate Iceberg column statistics

To generate statistics on the Information Catalog tables, full the next steps:

  1. On the AWS Glue console, select Databases beneath Information Catalog within the navigation pane.
  2. Select the tpcdsdbwithstats database to view all accessible tables.
  3. Choose any of those tables (for instance, call_center).
  4. Go to Column statistics – new and select Generate statistics.
  5. Preserve the default choices:
    1. For Select columns, choose Desk (All columns).
    2. For Row sampling choices, choose All rows.
    3. For IAM function, select AWSGluestats-blog-<your-stack-name>.
  6. Select Generate statistics.

You’ll have the ability to see standing of the statistics technology run as proven within the following screenshot.

After you generate the Iceberg desk column statistics, it is best to have the ability to see detailed column statistics for that desk.

Following the statistics technology, you will see an <id>.stat file within the AWS Glue desk’s underlying information location in Amazon S3. This file is a Puffin file that shops the Theta Sketch information construction. Question engines can use this Theta Sketch algorithm to effectively estimate the NDV when working on the desk, which helps optimize question efficiency.

Reiterate the earlier steps to generate statistics for all tables, resembling catalog_sales, catalog_returns, warehouse, merchandise, date_dim, store_sales, buyer, customer_address, web_sales, time_dim, ship_mode, web_site, and web_returns. Alternatively, you possibly can manually run the Lambda operate that instructs AWS Glue to generate column statistics for all tables. We talk about the small print of this operate later on this publish.

After you generate statistics for all tables, you possibly can assess the question efficiency for every question.

Run queries on Redshift Spectrum with statistics

Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables with column statistics.

To run the offered question utilizing Redshift Spectrum on the statistics tables, full the next steps:

  1. Within the Redshift question editor v2, run the queries listed in Redshift Question for tables with column statistics part within the downloaded file redshift-tpcds-sample.sql.
  2. Word the question runtime of every question.

With Redshift Serverless 128 RPU and the TPC-DS 3TB dataset, we carried out pattern runs for 10 chosen TPC-DS queries the place NDV info was anticipated to be helpful. We ran every question 10 occasions. The outcomes proven within the following desk are sorted by the proportion of the efficiency enchancment for the queries with column statistics.

TPC-DS 3T Queries With out Column Statistics With Column Statistics Efficiency Enchancment (%)
Question 16 305.0284 51.7807 489.1
Question 75 398.0643 110.8366 259.1
Question 78 169.8358 52.8951 221.1
Question 95 35.2996 11.1047 217.9
Question 94 160.52 57.0321 181.5
Question 68 14.6517 7.4745 96
Question 4 217.8954 121.996 78.6
Question 72 123.8698 76.215 62.5
Question 29 22.0769 14.8697 48.5
Question 25 43.2164 32.8602 31.5

The outcomes demonstrated clear efficiency advantages starting from 31.5–489.1%.

To dive deep, let’s discover question 16, which confirmed the very best efficiency profit:

TPC-DS Question 16:

choose
   depend(distinct cs_order_number) as "order depend"
  ,sum(cs_ext_ship_cost) as "complete delivery price"
  ,sum(cs_net_profit) as "complete web revenue"
from
   "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs1
  ,"awsdatacatalog"."tpcdsdbwithstats"."date_dim"
  ,"awsdatacatalog"."tpcdsdbwithstats"."customer_address"
  ,"awsdatacatalog"."tpcdsdbwithstats"."call_center"
the place
    d_date between '2000-2-01' 
    and dateadd(day, 60, solid('2000-2-01' as date))
    and cs1.cs_ship_date_sk = d_date_sk
    and cs1.cs_ship_addr_sk = ca_address_sk
    and ca_state="AL"
    and cs1.cs_call_center_sk = cc_call_center_sk
    and cc_county in ('Dauphin County','Levy County','Luce County','Jackson County',
                    'Daviess County')
and exists (choose *
            from "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs2
            the place cs1.cs_order_number = cs2.cs_order_number
            and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
and never exists(choose *
               from "awsdatacatalog"."tpcdsdbwithstats"."catalog_returns" cr1
               the place cs1.cs_order_number = cr1.cr_order_number)
order by depend(distinct cs_order_number)
restrict 100;

You may examine the distinction between the question plans with and with out column statistics with the ANALYZE question.

The next screenshot reveals the outcomes with out column statistics.

The next screenshot reveals the outcomes with column statistics.

You may observe some notable variations because of utilizing column statistics. At a excessive stage, the general estimated price of the question is considerably diminished, from 20633217995813352.00 to 331727324110.36.

The 2 question plans selected totally different be a part of methods.

The next is one line included within the question plan with out column statistics:

XN Hash Be a part of DS_DIST_BOTH (cost45365031.50 rows=10764790749 width=44)
" Outer Dist Key: ""outer"".cs_order_number"
Interior Dist Key: volt_tt_61c54ae740984.cs_order_number
" Hash Cond: ((""outer"".cs_order_number = ""inside"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inside"".cs_warehouse_sk))"

The next is the corresponding line within the question plan with column statistics:

XN Hash Be a part of DS_BCAST_INNER (price=307193250965.64..327130154786.68 rows=17509398 width=32)
" Hash Cond: ((""outer"".cs_order_number = ""inside"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inside"".cs_warehouse_sk))"

The question plan for the desk with out column statistics used DS_DIST_BOTH when becoming a member of giant tables, whereas the question plan for the desk with column statistics selected DS_BCAST_INNER. The be a part of order has additionally modified primarily based on the column statistics. These be a part of technique and be a part of order modifications are primarily pushed by extra correct be a part of cardinality estimations, that are attainable with column statistics, and end in a extra optimized question plan.

Schedule AWS Glue column statistics Runs

Sustaining up-to-date column statistics is essential for optimum question efficiency. This part guides you thru automating the method of producing Iceberg desk column statistics utilizing Lambda and EventBridge Scheduler. This automation retains your column statistics updated with out guide intervention.

The required Lambda operate and EventBridge schedule are already created by the CloudFormation template. The Lambda operate is used to invoke the AWS Glue column statistics run. First, full the next steps to discover how the Lambda operate is configured:

  1. On the Lambda console, select Capabilities within the navigation pane.
  2. Open the operate GlueTableStatisticsFunctionv1.

For a clearer understanding of the Lambda operate, we suggest reviewing the code within the Code part and inspecting the setting variables beneath Configuration.

As proven within the following code snippet, the Lambda operate invokes the start_column_statistics_task_run API by the AWS SDK for Python (Boto3) library.

Subsequent, full the next steps to discover how the EventBridge schedule is configured:

  1. On the EventBridge console, select Schedules beneath Scheduler within the navigation pane.
  2. Find the schedule created by the CloudFormation console.

This web page is the place you handle and configure the schedules in your occasions. As proven within the following screenshot, the schedule is configured to invoke the Lambda operate every day at a particular time—on this case, 08:27 PM UTC. This makes certain the AWS Glue column statistics runs on an everyday and predictable foundation.

Clear up

When you’ve got completed all of the above steps, keep in mind to scrub up all of the AWS sources you created utilizing AWS CloudFormation:

  1. Delete the CloudFormation stack.
  2. Delete S3 bucket storing the Iceberg desk for the TPC-DS dataset and the AWS Glue job script.

Conclusion

This publish launched a brand new function within the Information Catalog that lets you create Iceberg desk column-level statistics. The Iceberg desk shops Theta Sketch, which can be utilized to estimate NDV effectively in a Puffin file. The Redshift Spectrum CBO can use that to optimize the question plan, leading to improved question efficiency and potential price financial savings.

Check out this new function within the Information Catalog to generate column-level statistics and enhance question efficiency, and tell us your suggestions within the feedback part. Go to the AWS Glue Catalog documentation to be taught extra.


Concerning the Authors

Sotaro Hikita is a Options Architect. He helps prospects in a variety of industries, particularly the monetary {industry}, to construct higher options. He’s significantly obsessed with huge information applied sciences and open supply software program.

Noritaka Sekiyama is a Principal Huge Information Architect on the AWS Glue workforce. He’s chargeable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking along with his new street bike.

Kyle Duong is a Senior Software program Improvement Engineer on the AWS Glue and AWS Lake Formation workforce. He’s obsessed with constructing huge information applied sciences and distributed techniques.

Kalaiselvi Kamaraj is a Senior Software program Improvement Engineer with Amazon. She has labored on a number of tasks inside the Amazon Redshift question processing workforce and at the moment specializing in performance-related tasks for Redshift information lakes.

Sandeep Adwankar is a Senior Product Supervisor at AWS. Primarily based within the California Bay Space, he works with prospects across the globe to translate enterprise and technical necessities into merchandise that allow prospects to enhance how they handle, safe, and entry information.