Integration with AWS Glue

Jeralyn • November 20, 2023

Creating an AWS Glue crawler

AWS Glue is a fully managed ETL (extract, transform, and load) AWS service. One of its key abilities is to analyze and categorize data. You can use AWS Glue crawlers to automatically infer database and table schema from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.

Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in your Amazon Web Services account. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query.

To create database and table schema in the AWS Glue Data Catalog, you can run an AWS Glue crawler from within Athena on a data source, or you can run Data Definition Language (DDL) queries directly in the Athena Query Editor. Then, using the database and table schema that you created, you can use Data Manipulation (DML) queries in Athena to query the data.

Using AWS Glue to connect to data sources in Amazon S3

Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.
To define schema information for AWS Glue to use, you can create an AWS Glue crawler to retrieve the information automatically, or you can manually add a table and enter the schema information.

Creating an AWS Glue crawler

You can create a crawler by starting in the Athena console and then using the AWS Glue console in an integrated way. When you create the crawler, you specify a data location in Amazon S3 to crawl.

To create a crawler in AWS Glue starting from the Athena console

Open the Athena console at https://console.aws.amazon.com/athena/.
In the query editor, next to Tables and views, choose Create, and then choose AWS Glue crawler.
On the AWS Glue console Add crawler page, follow the steps to create a crawler. For more information, see Using AWS Glue Crawlers in this guide and Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Registering an AWS Glue Data Catalog from another account

You can use Athena's cross-account AWS Glue catalog feature to register an AWS Glue catalog from an account other than your own. After you configure the required IAM permissions for AWS Glue and register the catalog as an Athena Data Catalog resource, you can use Athena to run cross-account queries. For information about configuring the required permissions, see Cross-account access to AWS Glue data catalogs.

The following procedure shows you how to use the Athena console to configure an AWS Glue Data Catalog in an Amazon Web Services account other than your own as a data source.

To register an AWS Glue Data Catalog from another account

Follow the steps in Cross-account access to AWS Glue data catalogs to ensure that you have permissions to query the data catalog in the other account.
Open the Athena console at https://console.aws.amazon.com/athena/.
If the console navigation pane is not visible, choose the expansion menu on the left.
Choose Data sources.
On the upper right, choose Create data source.
On the Choose a data source page, for Data sources, choose S3 - AWS Glue Data Catalog, and then choose Next.
On the Enter data source details page, in the AWS Glue Data Catalog section, for Choose an AWS Glue Data Catalog, choose AWS Glue Data Catalog in another account.
For Data source details, enter the following information:

• Data source name – Enter the name that you want to use in your SQL queries to refer to the data catalog in the other account.

• Description – (Optional) Enter a description of the data catalog in the other account.

• Catalog ID – Enter the 12-digit Amazon Web Services account ID of the account to which the data catalog belongs. The Amazon Web Services account ID is the catalog ID.

9. (Optional) For Tags, enter key-value pairs that you want to associate with the data source. For more information about tags, see Tagging Athena resources.

10. Choose Next.

11. On the Review and create page, review the information that you provided, and then choose Create data source. The Data source details page lists the databases and tags for the data catalog that you registered.

12. Choose Data sources. The data catalog that you registered is listed in the Data source name column.

13. To view or edit information about the data catalog, choose the catalog, and then choose Actions, Edit.

14. To delete the new data catalog, choose the catalog, and then choose Actions, Delete.

Best practices when using Athena with AWS Glue

When using Athena with the AWS Glue Data Catalog, you can use AWS Glue to create databases and tables (schema) to be queried in Athena, or you can use Athena to create schema and then use them in AWS Glue and related services.

This topic provides considerations and best practices when using either method.

Under the hood, Athena uses Trino to process DML statements and Hive to process the DDL statements that create and modify schema. With these technologies, there are a couple of conventions to follow so that Athena and AWS Glue work well together.

< Older Post Newer Post >