How to Set Up an AWS Glue ETL Job: Step-by-Step Guide for Beginners

Table of Contents

What is AWS Glue?

AWS glue is ETL service provided by amazon web services. ETL means extract, transform and load. So if you have large excel/csv files and you want to run some transformations on these files some operations and prepare reports then AWS Glue is something you are looking for.

Spark-Powered Speed: Behind the scenes, Glue uses Apache Spark (the engine loved by data engineers) to process large datasets quickly. Spark’s lazy execution optimizes workflows by waiting until the last moment to run tasks, saving time and resources.

After this read you will be able to setup and run your first glue job.

Prerequisites

AWS account with IAM permissions for Glue/S3.
Sample S3 bucket and CSV/JSON file (provide a test dataset link for readers).

Step-by-Step Guide

Glue boilerplate code

When you will create a new ETL job in aws console, you will get a boilerplate code. Lets understand this.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Fetch job arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

# Initialize Glue and Spark contexts
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Commit the job
job.commit()

Importing Required Libraries:

aws.glue.transforms: Provides transformation functions for AWS Glue.
aws.glue.utils.getResolvedOptions: Retrieves arguments passed to the Glue job.
pyspark.context.SparkContext: The Spark context required for initializing PySpark.
aws.glue.context.GlueContext: Provides Glue-specific operations.
aws.glue.job.Job: Used to manage the Glue job.

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

Fetch job arguments. You can add addition arguments also which you can provide when starting the job.

# Initialize Glue and Spark contexts
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

Creates a SparkContext and a GlueContext, which acts as a bridge between Apache Spark and AWS Glue.

glueContext.spark_session: initializes a Spark session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

Initializes job.

job.commit()

Marks the end of job, commit any pending operations.

Extract Data in glue job

We have multiple data extraction options in glue like s3, databases, glue catalog, s3. In this guide we will focus on processing s3 data with glue.

In AWS glue job we have two options to load s3 data i.e. DynamicFrame or DataFrame.

Spark DataFrames are distributed collection of structured data. It is optimized for fast processing, sql queries, aggregations, machine learning.

Glue DynamicFrame are built on top of spark dataframe. We can use Glue DynamicFrame if we have unstructured data like some fields can be blank/null. It works well with AWS Glue transformations like applymapping().

We will talk about which one is better in upcoming blogs. But for now lets stick to spark dataframes.

Read csv file in glue job using spark

df = spark.read.option("header", "true").option("inferSchema", "true").csv("s3://your-bucket/input/")

header: true

if your csv file has first row as headers

inferSchema: true

When inferSchema is enabled, Spark automattically detects the data types of columns and dont treat all columns as strings. Spark does two passes over the data:

First Pass: Reads a small part of the file to determine the schema.
Second Pass: Reads the full dataset with the inferred schema.

So for large files we should avoid using inferSchema as true and should define schema for better performance.

Ok, so now we have our csv file data loaded from s3 to spark dataframes. What next?

Now we can perform transformations, aggregations, validations and writing operations on these data frames.

Transformations and Aggregations on sprak dataframes

Lets say you have one number field (amount) in your csv file which you want add or average out. Then i think first thing you need to do is cast the required column in your csv to double and then filter out rows where amount is null as you can’t add it or average it.

Step 1: Cast the ‘amount’ column to Double
df_casted=df.withColumn(“amount”,col(“amount”).cast(“double”))

Step 2: Filter out rows where the casted ‘amount’ is NULL (indicating cast failure)

df_filtered=df_casted.filter(col("amount").isNotNull())

If you want to count total rows, add amount or calculate average

df.agg(
    count("*").alias("total_transactions"),
    sum("amount").alias("total_amount"),
    avg("amount").alias("average_amount")
)

Now lets see how can we load our result dataframes back to s3.

Upload data from glue job to s3

output_path = "s3://your-bucket-name/path-to-output/"
df.write.mode("overwrite").option("header", "true").csv(output_path)

It by default save file in parts based on the partitions, if you want a single file you can use repartition()

df.repartition(1).write.mode("overwrite").option("header", "true").csv(output_path)

You are ready to go!

If you want more blogs on aws glue, spark dataframes operations. Please consider leaving a comment.