Montage and Docker

Image Processing in a Container

This Tutorial uses Docker technology to address a real scientific problem: building custom views of the sky. Starting from scratch (scratch here being a working Docker server), we will collect all the software components necessary to locate and retrieve remote archive data, then build and visualize maps of the sky. Both the sofware and the data are available off-the-shelf.

Our science task involves building a mosaic image for the galaxy Messier 051 and overlaying it with catalog data.

The first half of the tutorial constructs a docker image that has all the tools we will want for our task. This consists of the Montage software suite and Anaconda Python. Anaconda comes complete with a number of tools, such as Astropy and the SQLite database. We will add a few more.

The second half of the tutorial consists of using the docker image to solve our science problem. This includes subtleties like using visualization tools from outside the container and persisting the results after the container exits.

When we are done, we will have a tool that can be generalized for any image data and all on-line catalogs. This can either be wrapped as a command-line utility for ad hoc use or incorporated into large-scale (e.g. cloud) pipelines for massively parallel processing.

Docker server notes

Using your own Docker server

Building a Docker Image

Many times, you can find a prebuilt Docker image that does what you want. The rest of the time you will have to build your own. This isn't as scary as it sounds. All docker images are based on a configuration file ("Dockerfile") and a build command. Once it is built, this new image will be kept (unless you actively delete it) and can be reused over and over or even used as the basis for another image.

Technically, you can use other names than "Dockerfile" (and have more than one of them in the same directory) but don't; it gets very tedious and at least one thing (at Docker Hub) breaks if you do.

Below is the Dockerfile for our project. If you haven't already, log into your assigned AWS machine, create a workspace, and copy the Dockerfile and a Jupyter notebook file we will need later:

   mkdir Docker
   mkdir Docker/work
   cd Docker/work
   wget http://montage.ipac.caltech.edu/DockerTutorial/Dockerfile
   wget http://montage.ipac.caltech.edu/DockerTutorial/mViewer.ipynb

Dockerfile:

#  All Docker images have to be built 'FROM' something

FROM debian:latest


#  This is the way you set environment variables to be use in the running container

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ENV PATH /opt/conda/bin:$PATH
ENV LD_LIBRARY_PATH /montage/Montage/lib:$LD_LIBRARY_PATH


# 'RUN' directives modify the image in the same way running these commands would
#  change an OS instance. Only here we are modifying what will be installed in
#  the saved Docker image. We start with a few basic utilities that don't come 
#  with baseline debian. The reason for all the continuation lines is that every
#  RUN creates another layer and uses up more disk space.

RUN apt update --fix-missing && \
    apt install -y wget bzip2 git && \
    apt install -y curl grep sed && \
    apt install -y vim && \
    apt install -y build-essential


#  Installing Anaconda gives us a bunch of tools (in particular Jupyter and Astropy) 
#  that will let us better interact with our data. We then use pip to augment Anaconda
#  with astroquery and the Python Montage package.

RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh -O ~/anaconda.sh && \
    /bin/bash ~/anaconda.sh -b -p /opt/conda && \
    rm ~/anaconda.sh && \
    ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
    echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate base" >> ~/.bashrc

RUN pip install astroquery && \
    pip install MontagePy && \
    conda install -y plotly && \
    conda install -y qgrid


#  Download and install Montage:  While we have a copy of Montage install for Python
#  use, we are going to install it directly in the Docker container as well. This is
#  partly to illustrate actually building applications but also because in a large-scale
#  processing pipeline (e.g. building terabytes of little mosaics to support HiPS) we
#  don't really want to run that through Python and would prefer the command-line Montage.

#  Remember, this Dockerfile is not a shell script; there is no shell running.  So we 
#  don't really have any "state" (like a current working directory).  The 'WORKDIR' 
#  directive provides the equivalent, allowing use to do things the require location
#  context.

WORKDIR /montage

RUN wget --quiet http://montage.ipac.caltech.edu/download/Montage_07Aug19.tar.gz -O montage.tar.gz
RUN tar -zxf montage.tar.gz

WORKDIR /montage/Montage

RUN make
RUN cp bin/* /usr/local/bin/


#  Finally, for this container we want the default application to just be a shell
#  so we can be logged-in as soon as we start the container. Other containers often
#  default to running a web server, etc.

WORKDIR /work

CMD /bin/bash

Dockerfile notes

Once we have a Dockerfile we can create a docker image using it. This is done with the following command:

docker build --tag montage/jupyter:latest .

The 'tag' option allows you to choose a name for the image. The last argument (.) is the file context; any files referenced, etc. in the Docker file will be looked for here.

Docker images can be shared in multiple ways. The above just stores it in the local running Docker but it can be saved to tarfiles and/or uploaded to registries (most commonly Docker Hub). Also commonly, people publish their Dockerfiles in repositories like GitHub.

We are building ours directly on the Amazon machine instance. So then we can immediately use it (create a operational Docker container) by running:

 docker run --name montage_jupyter --rm -it -p 8888:8888 \
             -v /home/workshop_usr/Docker/work:/work montage/jupyter:latest

Here is an overview of the ancillary arguments and why we need them:

--name montage_jupyter : A local name for my container so we can keep track if it in case we have a bunch of containers running.
--rm : Because we run containers in a persistent daemon, some stuff (notably related to the container file system) can persist after the container exits. This makes sure everything is cleaned-up.
-ti : Shorthand for two flags. We want to run in interactive mode through a pseudo-TTY terminal.
-p 8888:8888 : We want port 8888 "inside" the container to be connected to port 8888 "outside" the container. That way if we start something (Jupyter or a web server) in the container we can connect to it from outside through port 8888 on the the Amazon machine address.
-v /home/workshop_usr/Docker/work:/work : Remember that everything inside a Docker container goes away when the container exits. If you want anything to persist (files that were created, changes to Jupyter pages, etc.) you need that to be in the host machine filesystem. This option allows you to "mount" a host machine directory to some location inside the container.

The last argument is the name of my Docker image.

Names and Tags

Once we run this, the container is operational and we see a prompt from the Debian OS inside my Docker container. In Part 2 of this tutorial, we will use this container to do our science project. This is a good break point; we have our image built and can run it as many times as we wish on as many platforms as we have.

Go to Part 2 ...