Published on

MLOps Basics [Week 7]: Container Registry - AWS ECR

Authors
normal

What is Container Registry ?

A container registry is a place to store container images. A container image is a file comprised of multiple layers which can execute applications in a single instance. Hosting all the images in one stored location allows users to commit, identify and pull images when needed.

There are many tools with which we can store the container images. The prominent ones are:

and many more...

I will be using AWS ECR.

In this post, I will be going through the following topics:

  • Basics of S3

  • Programmatic access to S3

  • Configuring AWS S3 as remote storage in DVC

  • Basics of ECR

  • Configuring GitHub Actions to use S3, ECR

Basics of S3

What is S3?

Amazon Simple Storage Service (S3) is a storage for the internet. It is designed for large-capacity, low-cost storage provision across multiple geographical regions.

Amazon S3 provides developers and IT teams with Secure, Durable and Highly Scalable object storage.

How is data organized in S3?

Data in S3 is organized in the form of buckets.

  • A Bucket is a logical unit of storage in S3.

  • A Bucket contains objects which contain the data and metadata.

Before adding any data in S3 the user has to create a bucket which will be used to store objects.

Creating bucket

normal
  • Bucket name details
normal
  • Created bucket
normal
  • Uploading a sample file
normal
  • Select any sample and upload it. After uploading, the bucket looks like
normal

Now that we have seen how to create a bucket and upload files, let's see how to access s3 programmatically.

Programmatic access to s3

We can access s3 either via cli or any programming language. Let's see both ways.

Credentials are required to access any aws service. There are different ways of configuring credentials. Let's look at a simple way.

  • Go to My Security Credentials
normal
  • Navigate to Access Keys section and click on Create New Access Key button.
normal

This will download a csv file containing the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

Do not share the secrets with others

Set the ACCESS key and id values in environment variables.

export AWS_ACCESS_KEY_ID=<ACCESS KEY ID>
export AWS_SECRET_ACCESS_KEY=<ACCESS SECRET>

Accessing s3 using CLI

Download the AWS CLI package and install it from here

aws cli comes with a lot of commands. Check the documentation here

Let's see what all present in s3 bucket using cli.

aws s3 ls s3://models-dvc/

Output looks like

(base) ravirajas-MacBook-Pro » aws s3 ls s3://models-dvc/
2021-07-24 12:39:21         22 sample.txt

For all the available list of commands, refer to the documentation here

Accessing s3 using Python

Install the boto3 library which is AWS SDK for Python

pip install boto3

The following code prints the contents of s3 bucket.

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('models-dvc')
for obj in bucket.objects.all():
    print(obj.key)

Output looks like:

sample.txt

For all the available list of commands, refer to the documentation here

Configuring AWS S3 as remote storage in DVC

Let's see how to configure s3 as the remote storage in DVC where trained models can be pushed.

Let's create a folder called trained_models in s3 which will be used for storing the trained models.

In order to use dvc for s3 make sure you install dvc with s3 support.

pip install "dvc[s3]"

Initialise the dvc (if not initialised) using the following command:

dvc init

Configure the remote storage to s3 location.

Get the s3 url as mentioned below

normal

Add the s3 folder as remote storage for models in dvc.

dvc remote add -d model-store s3://models-dvc/trained_models/

Make sure the AWS credentials are set in ENV.

Now let's add the trained model to dvc using the following command:

cd dvcfiles
dvc add ../models/model.onnx --file trained_model.dvc

Push the model to remote storage

dvc push trained_model.dvc

Once the model is pushed to dvc, refresh the s3.

normal

Basics of ECR

In the previous week, we have build container using CICD, but the image is not persisted anywhere for further usage. This is where Container Registry comes into the picture.

normal

Search for ECR and click on Get Started

normal

Create a repository when prompted with name mlops-basics

normal

Let's build the docker image and push it to ECR.

Before building the docker image need to modify the Dockfile. Till now I am using Google Drive as the remote storage. That needs to be changed to S3.

The dockerfile looks like:

FROM huggingface/transformers-pytorch-cpu:latest

COPY ./ /app
WORKDIR /app

ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY


# aws credentials configuration
ENV AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
    AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY


# install requirements
RUN pip install "dvc[s3]"   # since s3 is the remote storage
RUN pip install -r requirements_inference.txt

# initialise dvc
RUN dvc init --no-scm

# configuring remote server in dvc
RUN dvc remote add -d model-store s3://models-dvc/trained_models/

RUN cat .dvc/config

# pulling the trained model
RUN dvc pull dvcfiles/trained_model.dvc

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

# running the application
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Build the docker image using the command:

docker build --build-arg AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID --build-arg AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY  -t inference:test .

Now let's push the image to ECR.

Commands required to push the image to ECR can be found in the ECR itself

normal

Following the commands there:

  • Authenticating docker client to ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 246113150184.dkr.ecr.us-west-2.amazonaws.com
  • Tagging the image
docker tag inference:test 246113150184.dkr.ecr.us-west-2.amazonaws.com/mlops-basics:latest
  • Pushing the image
docker push 246113150184.dkr.ecr.us-west-2.amazonaws.com/mlops-basics:latest

Configuring GitHub Actions to use S3, ECR

Now let's see how to configure S3, ECR in Github Actions

We need AWS credentials for fetching the model from S3, Pushing the image to ECR. We can't share this information publicly. Fortunately GitHub Actions has a way to store these information securely.

It's called Secrets.

  • Go to the settings tab of the repository
normal
  • Go to Secrets section and click on New repository secret
normal
  • Save the following secrets:

    • AWS_ACCESS_KEY_ID

    • AWS_SECRET_ACCESS_KEY

    • AWS_ACCOUNT_ID (this is the account id of profile)

These values can be used in GitHub Actions in the following manner:

  • AWS_ACCESS_KEY_ID: {{secrets.AWS_ACCESS_KEY_ID}}
  • AWS_SECRET_ACCESS_KEY: {{secrets.AWS_SECRET_ACCESS_KEY}}
  • AWS_ACCOUNT_ID: {{secrets.AWS_ACCOUNT_ID}}

Let's modify the workflow file.

GitHub Actions Marketplace comes with lot of predefined actions which are useful for us.

name: Create Docker Container

on: [push]

jobs:
  mlops-container:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./week_7_ecr
    steps:
      - name: Checkout
        uses: actions/checkout@v2
        with:
          ref: ${{ github.ref }}
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2
      - name: Build container
        run: |
          docker build --build-arg AWS_ACCOUNT_ID=${{ secrets.AWS_ACCOUNT_ID }} \
                       --build-arg AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} \
                       --build-arg AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} \
                       --tag mlops-basics .
      - name: Push2ECR
        id: ecr
        uses: jwalton/gh-ecr-push@v1
        with:
          access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          region: us-west-2
          image: mlops-basics:latest

Let's understand what happening here:

  • Jobs will run on ubuntu-latest runner

  • Clones the code and navigates to week_7_ecr directory

  • Sets the AWS environment variables using aws-actions/configure-aws-credentials@v1 action

  • Builds the image and tag it with mlops-basics tag

  • Push the image to ECR using jwalton/gh-ecr-push@v1 action.

Output will look like:

In actions tab Github:

normal

In the ECR:

normal

🔚

This concludes the post. We have seen how to automatically create a docker image using GitHub Actions and save it to ECR. How to use S3 as the remote storage.

Complete code for this post can also be found here: Github

References