AWS Glue: Complete ETL Guide with PySpark Examples

AWS Glue: The Complete ETL and Data Integration Guide

Server infrastructure with connected cables
Server infrastructure with connected cables

AWS Glue has gotten complicated with all the job types, pricing models, and competing ETL approaches flying around. As someone who has built data pipelines on Glue for multiple production workloads — processing everything from small daily CSV imports to multi-terabyte data lake transformations — I learned everything there is to know about what works, what’s overrated, and where Glue genuinely shines. Today, I will share it all with you.

Best Practices”,mastering-aws-api-gateway-a-comprehensive-guide/” style=”color:#0073aa;text-decoration:none;”>AWS API Gateway: REST APIs

Core Components of AWS Glue

1. Glue Data Catalog

Probably should have led with this section, honestly. The Data Catalog is Glue’s central metadata repository, and it’s arguably more important than the ETL engine itself. Think of it as a supercharged Hive metastore — it stores table definitions, schemas, partition information, and connection details for all your data sources. Every other analytics service on AWS (Athena, Redshift Spectrum, EMR) can read from the Data Catalog, which makes it the backbone of your data lake architecture.

Best Practices”,mastering-aws-api-gateway-a-comprehensive-guide/” style=”color:#0073aa;text-decoration:none;”>AWS API Gateway: REST APIs

Component Description Use Case
Databases Logical containers for tables Organize tables by domain (sales_db, marketing_db)
Tables Schema definitions pointing to data locations Define structure for S3 data, JDBC sources
Crawlers Automated schema discovery Scan S3/databases and infer schemas
Connections Network and credential configs Connect to RDS, Redshift, on-prem databases

The Data Catalog’s integration with other services is what makes it genuinely valuable beyond just Glue. When you define a table in the Catalog, Athena can immediately query that data. Redshift Spectrum can join it with your warehouse data. EMR notebooks can reference it. It becomes a single source of truth for metadata across your analytics stack, which is a massive improvement over the old approach of maintaining separate schema definitions in every tool.

2. Glue ETL Jobs

That’s what makes Glue ETL endearing to us data engineers — the serverless execution model means you define your transformation logic and Glue handles all the infrastructure. No cluster management, no capacity planning, no idle resources burning money between job runs.

Best Practices”,mastering-aws-api-gateway-a-comprehensive-guide/” style=”color:#0073aa;text-decoration:none;”>AWS API Gateway: REST APIs

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Data Catalog
source_dyf = glueContext.create_dynamic_frame.from_catalog(
    database="sales_db",
    table_name="raw_orders",
    transformation_ctx="source_dyf"
)

# Apply transformations
# 1. Drop null values
cleaned_dyf = DropNullFields.apply(
    frame=source_dyf,
    transformation_ctx="cleaned_dyf"
)

# 2. Map columns to new schema
mapped_dyf = ApplyMapping.apply(
    frame=cleaned_dyf,
    mappings=[
        ("order_id", "string", "order_id", "string"),
        ("customer_id", "string", "customer_id", "string"),
        ("amount", "double", "order_amount", "decimal"),
        ("order_date", "string", "order_date", "date")
    ],
    transformation_ctx="mapped_dyf"
)

# 3. Convert to Spark DataFrame for complex transformations
df = mapped_dyf.toDF()
df = df.filter(df.order_amount > 0)
df = df.withColumn("year", year(df.order_date))
df = df.withColumn("month", month(df.order_date))

# Convert back to DynamicFrame
output_dyf = DynamicFrame.fromDF(df, glueContext, "output_dyf")

# Write to S3 as partitioned Parquet
glueContext.write_dynamic_frame.from_options(
    frame=output_dyf,
    connection_type="s3",
    connection_options={
        "path": "s3://my-bucket/processed/orders/",
        "partitionKeys": ["year", "month"]
    },
    format="parquet",
    transformation_ctx="output"
)

job.commit()

A few things to note about this code. The transformation_ctx parameter is crucial — it enables job bookmarks (incremental processing), which we’ll cover next. The DynamicFrame is Glue’s extension of Spark DataFrames that handles schema inconsistencies more gracefully. And partitioning the output by year and month dramatically improves query performance when downstream consumers (Athena, Redshift Spectrum) filter by date.

Glue Crawlers: Automated Schema Discovery

Crawlers scan your data sources and automatically infer table schemas. They’re incredibly useful when you’re ingesting data from sources where the schema might change over time or when you’re cataloging existing data in S3 that doesn’t have explicit schema definitions.