It’s much faster to be able to develop and debug AWS Glue / PySpark scripts locally.
The Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library instructions describe installation but are not complete. There are certain dependencies to consider to make this work.
Also, note that the version of PySpark used (2.4.3) for Glue 1.0 does not support:
- Python 3.8, which Ubuntu 20.04 comes with. Therefore some of the PySpark code needs to be hacked a bit as per Stackoverflow and Gist.
- Java 11, which Ubuntu 20.04 comes with. OpenJDK 8 headless is therefore installed and made the default runtime interpreter
Install Ubuntu package dependencies
First install the Ubuntu package dependencies:
sudo apt install zip sudo apt install python-pytest # Maven v3.6.3 is currently distributed sudo apt install maven # Ubuntu 20.04 comes with openjdk 11 per default, which PySpark is not compatible with sudo apt install openjdk-8-jdk-headless sudo update-alternatives --config java # Choose the option /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1 sudo update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
Install AWS Glue Python library
cd mkdir -p $HOME/app # Just get the zip file from github, no need to clone repo (get glue-1.0, which supports Python3) curl -LO https://github.com/awslabs/aws-glue-libs/archive/glue-1.0.zip unzip glue-1.0.zip -d $HOME/app mv $HOME/app/aws-glue-libs-glue-1.0 $HOME/app/aws-glue-libs rm glue-1.0.zip
Install Glue (1.0) artifacts
cd mkdir -p $HOME/app curl -LO https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz tar xvpfz spark-2.4.3-bin-hadoop2.8.tgz -C $HOME/app rm spark-2.4.3-bin-hadoop2.8.tgz
Now, as per this Gist the file
$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py will need to be edited in order for PySpark to work with Python 3.8. If this is not done then the PySpark shell will fail to start.
First make a copy of the file:
cp -p $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py \ $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py.original
Then edit the file
$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py and make the necessary changes as per abovementioned Gist.
Create glue source file
mkdir -p $HOME/bin cat <<EOF >$HOME/bin/glue # Spark / Glue export SPARK_HOME=\$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8 export PATH=\$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/bin:\$PATH export PATH=\$HOME/app/aws-glue-libs/bin:\$PATH EOF
Source environment and test
./bin/gluepyspark command will download a considerable number of artifacts using
. glue cd $HOME/app/aws-glue-libs # Start Glue Shell ./bin/gluepyspark
pyspark for IDE lookup
Optionally, it’s possible to install PySpark (same version as that used with Glue) in the virtualenv to ensure that the IDE can cross-reference it.
# For VSCode run this from the project/git folder python3 -m venv . pip3 install pyspark==2.4.3
Test AWS Glue set-up & PySpark
Before starting the Glue PySpark shell:
- Make sure relevant AWS credentials are available via environment variables or
.aws/credentialsprofile (and AWS_PROFILE is set accordingly)
- Set a suitable region
Start the Glue PySpark shell :
export AWS_REGION=eu-west-1 cd $HOME/app/aws-glue-libs ./bin/gluepyspark
Run some test code - if this doesn’t yield an error then it’s ready to go:
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) job = Job(glueContext) job.init('test-job1') job.commit()