Aug 05, 2020
Install & run AWS Glue 1.0 and PySpark on Ubuntu 20.04
Background
It’s much faster to be able to develop and debug AWS Glue / PySpark scripts locally.
The Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library instructions describe installation but are not complete. There are certain dependencies to consider to make this work.
Also, note that the version of PySpark used (2.4.3) for Glue 1.0 does not support:
- Python 3.8, which Ubuntu 20.04 comes with. Therefore some of the PySpark code needs to be hacked a bit as per Stackoverflow and Gist.
- Java 11, which Ubuntu 20.04 comes with. OpenJDK 8 headless is therefore installed and made the default runtime interpreter
Install Ubuntu package dependencies
First install the Ubuntu package dependencies:
sudo apt install zip
sudo apt install python-pytest
# Maven v3.6.3 is currently distributed
sudo apt install maven
# Ubuntu 20.04 comes with openjdk 11 per default, which PySpark is not compatible with
sudo apt install openjdk-8-jdk-headless
sudo update-alternatives --config java
# Choose the option /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
Consider updating python
and pip
alternatives:
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1
sudo update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
Install AWS Glue Python library
cd
mkdir -p $HOME/app
# Just get the zip file from github, no need to clone repo (get glue-1.0, which supports Python3)
curl -LO https://github.com/awslabs/aws-glue-libs/archive/glue-1.0.zip
unzip glue-1.0.zip -d $HOME/app
mv $HOME/app/aws-glue-libs-glue-1.0 $HOME/app/aws-glue-libs
rm glue-1.0.zip
Install Glue (1.0) artifacts
cd
mkdir -p $HOME/app
curl -LO https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
tar xvpfz spark-2.4.3-bin-hadoop2.8.tgz -C $HOME/app
rm spark-2.4.3-bin-hadoop2.8.tgz
Now, as per this Gist the file $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py
will need to be edited in order for PySpark to work with Python 3.8. If this is not done then the PySpark shell will fail to start.
First make a copy of the file:
cp -p $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py \
$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py.original
Then edit the file $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py
and make the necessary changes as per abovementioned Gist.
Create glue source file
mkdir -p $HOME/bin
cat <<EOF >$HOME/bin/glue
# Spark / Glue
export SPARK_HOME=\$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
export PATH=\$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/bin:\$PATH
export PATH=\$HOME/app/aws-glue-libs/bin:\$PATH
EOF
Source environment and test
Below ./bin/gluepyspark
command will download a considerable number of artifacts using maven
.
. glue
cd $HOME/app/aws-glue-libs
# Start Glue Shell
./bin/gluepyspark
pyspark for IDE lookup
Optionally, it’s possible to install PySpark (same version as that used with Glue) in the virtualenv to ensure that the IDE can cross-reference it.
# For VSCode run this from the project/git folder
python3 -m venv .
pip3 install pyspark==2.4.3
Test AWS Glue set-up & PySpark
Before starting the Glue PySpark shell:
- Make sure relevant AWS credentials are available via environment variables or
.aws/credentials
profile (and AWS_PROFILE is set accordingly) - Set a suitable region
Start the Glue PySpark shell :
export AWS_REGION=eu-west-1
cd $HOME/app/aws-glue-libs
./bin/gluepyspark
Run some test code - if this doesn’t yield an error then it’s ready to go:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
job.init('test-job1')
job.commit()