How to Get Started with AWS Glue ETL
In the rapidly evolving landscape of data management and analytics, AWS Glue has emerged as a critical tool for businesses navigating the complexities of data processing. As part of Amazon Web Services, AWS Glue offers a comprehensive, managed service that streamlines the tasks of data extraction, transformation, and loading (ETL).
This in-depth guide is designed to provide a thorough understanding of AWS Glue and the CData AWS Glue Connectors, showcasing their capabilities and guiding you to utilize their full potential for efficient data processing.
What Is AWS Glue?
AWS Glue represents a significant advancement in the field of data management within the Amazon Web Services suite. It offers a fully managed, serverless ETL service that simplifies the complexities of data extraction from various sources, transforming this data into a structured, analyzable format, and efficiently loading it into a data warehouse or data lake for analytics and business intelligence.
The pivotal role in modern data management
AWS Glue plays a vital role in managing large-scale data operations. Its capacity to automate the ETL process allows businesses to concentrate on extracting meaningful insights from their data. This automation is crucial in today's data-driven world, where the efficient handling of vast data volumes can significantly impact business strategies and outcomes.
Key features of AWS Glue
AWS Glue is distinguished by several innovative features that make it a leader in data processing:
One of AWS Glue's most notable features is its scalability. It dynamically adjusts resource allocation based on the demands of your data processing tasks. This adaptability ensures that businesses can handle varying data loads efficiently and cost-effectively.
The serverless architecture of AWS Glue means that businesses no longer need to worry about the maintenance and management of physical servers. This focus on data rather than infrastructure is a game-changer, significantly reducing overhead costs and complexity.
In-depth component exploration
- Data catalog: This acts as a central repository for all metadata, simplifying data discovery and governance.
- Job scheduling: It automates the timing and execution of ETL jobs, ensuring that data is processed efficiently and on time.
- Crawlers: Automatically detect and classify data, keeping the Data Catalog up-to-date.
- Data store: Provides a secure storage solution for your transformed data, ready for analysis and business intelligence applications.
3 Benefits of using AWS Glue
1. Cost Efficiency
AWS Glue's pay-as-you-go pricing model ensures businesses only pay for the resources they use, leading to significant cost savings, especially for companies dealing with fluctuating data processing needs.
2. Simplification of ETL processes
AWS Glue can automatically generate ETL code, which marks a significant step forward in simplifying data transformation processes. This automation reduces the need for extensive manual coding, speeding up the ETL pipeline and reducing the likelihood of errors.
3. Adaptability to schema changes
The automatic schema detection and adaptation feature of AWS Glue ensures high data quality and consistency. This adaptability is essential for maintaining data accuracy, a critical factor in data analytics and business decision-making.
Use cases for AWS Glue
-
Enhanced analytics on Amazon S3
AWS Glue simplifies the process of performing analytics on data stored in Amazon S3. By managing the complexities of server infrastructure, it enables businesses to focus on extracting valuable insights from their data.
-
Integration of diverse AWS data sets
The ability of AWS Glue to seamlessly integrate data from various AWS sources is invaluable for businesses that utilize multiple AWS services. This integration facilitates a more comprehensive and unified approach to data analytics.
-
Creation of real-time ETL workflows
AWS Glue excels in building ETL workflows that respond to events in real-time. This feature is particularly beneficial for businesses that require immediate data processing to inform quick decision-making.
CData AWS Glue Connectors
CData's connectors for AWS Glue enhance the functionality of AWS Glue by providing:
- Seamless integration with various data sources, enhancing the data processing capabilities of AWS Glue
- Advanced functionality through AWS Glue Studio, allowing for more sophisticated ETL processes
- Access to real-time data, which is essential for making informed decisions in a fast-paced business environment
Getting started with AWS Glue ETL and CData AWS Glue Connectors
Prerequisites and requirements
There are no external operating system, database type, or storage requirements for using the CData AWS Glue Connector for Connect Cloud. Customers will need familiarity with AWS Glue, AWS Glue Studio, and Python/Apache Spark to best utilize the Glue Connector for Connect Cloud. They will will need an AWS account and a subscription to the AWS Glue Connector.
AWS Glue and Glue Studio jobs run on Amazon EC2 instances, the CData AWS Glue Connector is a container image that runs on Amazon ECS, and the sample Glue job in this walkthrough stores data in Amazon S3.
You will also need a CData Connect Cloud account, and a connection configured in CData Connect Cloud.
Update permissions for your IAM role
When you create the AWS Glue job, you specify an AWS Identity and Access Management (IAM) role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 for any sources, targets, scripts, temporary directories, and AWS Glue Data Catalog objects. The role must also grant access to the CData Glue Connector for Salesforce from the AWS Glue Marketplace.
NOTE: Do not use the root user for any deployments or operations.
The following policies should be added to the IAM role for the AWS Glue job, at a minimum:
- AWSGlueServiceRole (for accessing Glue Studio and Glue Jobs)
- AmazonEC2ContainerRegistryReadOnly (for accessing the CData AWS Glue Connector for Connect Cloud)
If you will be accessing data found in Amazon S3, add:
- AmazonS3FullAccess (for reading from and writing to Amazon S3)
And lastly, if you will be using AWS Secrets Manager to store confidential connection properties (see more below), you will need to add an inline policy like the following, granting access to the specific secrets needed for the Glue Job:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetResourcePolicy",
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"secretsmanager:ListSecretVersionIds"
],
"Resource": [
"arn:aws:secretsmanager:us-west-2:111122223333:secret:aes128-1a2b3c",
"arn:aws:secretsmanager:us-west-2:111122223333:secret:aes192-4D5e6F",
"arn:aws:secretsmanager:us-west-2:111122223333:secret:aes256-7g8H9i"
]
}
]
}
For more information about granting access to AWS Glue Studio and Glue Jobs, see Setting up IAM Permissions for AWS Glue in the AWS Glue documentation.
For more information about granting access to the Amazon S3 buckets, see Identity and Access Management in the Amazon Simple Storage Service Developer Guide.
For more information on setting up access control for your secrets, see Authentication and Access Control for AWS Secrets Manager in the AWS Secrets Manager documentation and Limiting Access to Specific Secrets in the AWS Secrets Manager User Guide. The credential retrieved from AWS Secrets Manager (a string of key-value pairs) is used in the JDBC URL used by the CData Glue Connector when connecting to the data source, as shown above.
For more general information on IAM and IAM best practices, refer to the AWS IAM page.
Collect Connect Cloud connection properties
The best way to authenticate with Connect Cloud in AWS Glue is using a username and password.
Your Username is your Connect Cloud user name (likely an email address). Your Password is a PAT created in the Connect Cloud interface.
Subscribe to the CData Glue Connector for Connect Cloud
To work with the CData Glue Connector for Connect Cloud in AWS Glue Studio, you need to subscribe to the Connector from the AWS Marketplace. If you have already subscribed to the CData Glue Connector for Connect Cloud, you can jump to the next section. Note that there is no monthly subscription fee for the CData Connect Cloud Glue Connector.
Activate the CData Glue Connector for Connect Cloud in Glue Studio
To use the CData Glue Connector for Connect Cloud in AWS Glue, you need to activate the subscribed connector in AWS Glue Studio. The activation process creates a connector object and connection in your AWS account.
- Once you subscribe to the connector, a new Config tab shows up in the AWS Marketplace connector page.
- Choose the delivery options and click the "Continue to Launch" button.
- On the launch tab, click "Usage Instructions" and follow the link that appears to create and configure the connection.
- Under Connection access, select the JDBC URL format and configure the connection. Below you will find sample connection string(s) for the JDBC URL format(s) available for Connect Cloud. You can read more about authenticating with Connect Cloud in the Help documentation for the Connector.
- Username & Password jdbc:cdata:connect:AuthScheme=BASIC;User=${Username};Password=${Password}
- (Optional) Enable logging for the Connector. If you want to log the functionality from the CData Glue Connector for Connect Cloud you will need to append two properties to the JDBC URL:
-
Logfile: Set this to "STDOUT://"
-
Verbosity: Set this to an integer (1-5) for varying depths of logging. 1 is the default, 3 is recommended for most debugging scenarios.
- Configure the Network options and click "Create Connection."
Configure the Amazon Glue job
Once you have configured a connection, you can build a Glue Job.
Create a job that uses the connection
- In Glue Studio, under "Your connections," select the connection you created
- Click "Create job"
The visual job editor appears. A new Source node, derived from the connection, is displayed on the Job graph. In the node details panel on the right, the Source Properties tab is selected for user input.
Configure the Source Node properties
You can configure the access options for your connection to the data source in the Source properties tab. Refer to the AWS Glue Studio documentation for more information. Here's a simple walk-through:
- In the visual job editor, make sure the Source node for your connector is selected. Choose the Source properties tab in the node details panel on the right, if it is not already selected.
- The Connection field is populated automatically with the name of the connection associated with the marketplace connector.
- Enter information about the data location in the data source. Provide either a source table name or a query to use to retrieve data from the data source. An example of a query is SELECT Industry, AnnualRevenue FROM Account WHERE Name = 'GenePoint'.
- To pass information from the data source to the transformation nodes, AWS Glue Studio must know the schema of the data. Select "Use Schema Builder" to specify the schema interactively.
- Configure the remaining optional fields as needed. You can configure the following:
-
Partitioning information for parallelizing the read operations from the data source
-
Data type mappings to convert data types used in the source data to the data types supported by AWS Glue
-
Filter predicate to select a subset of the data from the data source
-
See "Use the Connection in a Glue job using Glue Studio" for more information about these options.
-
You can view the schema generated by this node by choosing the Output schema tab in the node properties panel.
Edit, save, and run the job
Edit the job by adding and editing the nodes in the job graph. See Editing ETL jobs in AWS Glue Studio for more information.
After you complete editing the job, enter the job properties.
- Select the Job properties tab above the visual graph editor.
- Configure the following job properties when using custom connectors:
- Name: Provide a job name.
- IAM Role: Choose (or create) an IAM role with the necessary permissions, as described previously.
- Type: Choose "Spark."
- Glue version: Choose "Glue 2.0 - Supports spark 2.4, Scala 2, Python 3."
- Language: Choose "Python 3."
- Use the default values for the other parameters. For more information about job parameters, see "Defining Job Properties" in the AWS Glue Developer Guide.
Using the CData Glue Connector for Connect Cloud in AWS Glue Studio, you can easily create ETL jobs to load cloud data into an S3 bucket or any other destination. You can also use the Glue Connector to add, update, or delete Connect Cloud data in your Glue Jobs.
Health check
The CData Glue Connector for Connect Cloud is used as part of AWS Glue jobs. As such, you can use CloudWatch and the built-in logging (see the optional logging instructions above) to monitor the health of the Glue job and the functionality of the Connector.
Backup and recovery
The CData Glue Connector for Connect Cloud is a deployed container. Backup and recovery consist of simply resubscribing to the Connector in the event of a failure or corrupted deployment.
Routine maintenance
There are several pieces of routine maintenance involved with the CData Glue Connectors:
- Rotating credentials and keys: Follow the guidance of your IT administration for the rotation of any credentials & keys stored in the AWS Secrets Manager
- Software patches and upgrades: The CData Glue Connector will only be patched for breaking errors. Upgrades will be released quarterly. Monitor the "Latest Version" in the AWS Marketplace listing and simply re-subscribe if the Latest Version is greater than your currently subscribed version.
- Managing license: Licenses can be managed (subscriptions discontinued as needed) in the AWS License Manager
CData & AWS Glue: Better together
AWS Glue is a cloud-based ETL tool that enables easier data management and analytics. Its combination of automation, scalability, and serverless architecture positions it as an ideal solution for businesses looking to streamline their data processing workflows.
In conjunction with CData Connect Cloud, AWS Glue offers an even more powerful platform, enabling businesses to fully harness the potential of their data for strategic decision-making and growth.
Get started with CData
Get a free 30-day trial of CData Connect Cloud and start leveraging the AWS Glue Connector today!
Start free