Writing the spark code

To run a custom application we created a simple spark application that we will go through step by step

Dependencies

to run this spark code we are going to use to external libraries. to do that we are going to create the requierments file with these lines:

matplotlib
pandas

Data

At its core a spark application is a data analysis/transformation program so we need to have the data ready in a S3 bucket. the data represents the quality of air in the capital of bulgaria sofia. You can download it here and put it in an s3 bucket of your choice.

Code

from pyspark.sql.types import *

schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("sensor_id", IntegerType(), False),
    StructField("location", IntegerType(), False),
    StructField("lat", DoubleType(), False),
    StructField("lon", DoubleType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("pressure", DoubleType(), True),
    StructField("temperature", DoubleType(), True),
    StructField("humidity", DoubleType(), True)
])
sensor_data = sqlContext.read.csv('s3a://<cuket-name>/sofia-air-quality/*.csv', schema=schema, header=True, 
                                  timestampFormat="uuuu-MM-dd'T'HH:mm:s
sqlContext.registerDataFrameAsTable(sensor_data, "sensor_data")
timeseries = sql('''
SELECT timestamp, temperature, pressure
FROM sensor_data 
WHERE sensor_id = 5354
AND timestamp BETWEEN '2018-08-01' AND '2018-08-31'
''').toPandas()
timeseries.mean(numeric_only=True)
timeseries.plot(x='timestamp', y='temperature')
timeseries.plot(x='timestamp', y='pressure')

let’s save this code as main.py and the requirements.txt in the same folder TODO add comments to the code above

DockerFile

Onto the next step, we will be creating an image which will running our application. We will be basing our image on top of the spark 3.3.0 image of ocean for spark

FROM gcr.io/datamechanics/spark:platform-3.3.0-hadoop-3.3.0-java-11-scala-2.12-python-3.8-dm18

ENV PYSPARK_MAJOR_PYTHON_VERSION=3
WORKDIR /opt/application/

COPY requirements.txt .
RUN pip3 install -r requirements.txt

COPY src/ src/

COPY main.py .

Then we should just build this docker image and push it in the ecr repository, we need to login to ecr aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com

then we need to run these two commands:

docker build -t <your image name> .
docker push <your image name>

Once the docker image is published we should go to the next section to understand and create a configuration template for the application.