Distributing confidential Docker images

Here’s my another pet peeve with Docker: the infrastructure for distributing images is simply Not There Yet. What do we have now? There’s a public image index (I still don’t fully get the distinction of index vs registry, but it looks like a way for DotCloud to have some centralized service that’s needed also for private images). I can run my own registry, either keeping the access completely open (with access limited only by IP or network interface), or delegating the authentication to DotCloud’s central index. Even if I choose to authenticate against the index, there doesn’t seem to be any way to actually limit access to the registry — it looks like anyone who has an index account and HTTP(S) access to the registry can download or push images.

It doesn’t seem there is any way in the protocol to authenticate users against anything that’s not the central index – not even plain http auth. Just to get https access, I need to put Apache or nginx in front of the registry. And did I mention that there is no way to move full images between Docker hosts without a registry, not even a tarball export?

I fully understand that Docker is still in development, and that these problems mean that there is not much of bigger showstopper issues, which is actually good. However, this seems to seriously limit usefulness of Docker in production environments; I need to either stop controlling who’s able to download my images, or I need to build image locally on each Docker host — which prevents me from building an image once, testing it, and then using the very same image everywhere.

And the problem with distribution is not only with distributing in-house confidential software. A lot of open source projects run on Java (off the top of my head: Jenkins, RunDeck, Logstash + Elasticsearch, almost anything from Apache Software Foundation…). While I support OpenJDK with all my heart, Oracle’s JVM still wins in terms of performance and reliability; and it’s not allowed to distribute Oracle’s JVM except internally within an organization. I may also want to keep my Docker images partially configured – software is open, but I’d prefer not to publish internal passwords, access keys, or IP numbers.

I hope that in the long run, it will be possible to exchange images in different ways (plain old rsync, distribution via bittorrent, git-annex network, shared filesystems… I could go on and on). Right now, I found only one way, and it doesn’t seem obvious, so I want to share it. Here it is:

Docker’s registry server doesn’t keep any local data; all it knows is in its storage backend (an on-disk directory, or an Amazon S3 bucket). This means it’s possible to run the registry locally (on 127.0.0.1), and move access control to the storage backend; you don’t control Docker’s access to the registry, but registry’s access to the storage. It may be implemented either as a shared filesystem (GlusterFS, or even NFS), it may be an automatically synced directory on disk, or – which is what I prefer – a shared S3 bucket. Each Docker host runs its own registry, attached to the same bucket, with a read-only key pair (to make sure it won’t be able to overwrite tags or push images). The central server that is allowed to build and tag images is the only one that has write access. Images stay confidential, and there even is a crude access control (read-only vs write access). It’s not the best performance you can get to distribute the images, but it gets the job done until there’s a more direct way to export/import a whole image.

I hope this approach is useful; have fun with it!


Flat Docker images

Docker seems to be the New Hot Thing these days. It is an application container engine – it lets you pack any Linux software into a self-contained, isolated container image that can be easily distributed and run on different host machines. An image is somewhere in between a well built Omnibus package, and a full-on virtual machine: it’s an LXC container filesystem plus a bit of configuration on top of it (environment variables, default command to run, UID to run it as, TCP/UDP ports to expose, etc). Once you have the image built or downloaded, you can use it to start one or many containers – live environments that actually run the software.

To conserve disk space (and RAM cache), Docker uses AUFS to overlay filesystems. When you start a container from an image, Docker doesn’t copy all the image’s files to the container’s root. It overlays a new read/write directory on top of read-only directory with the image’s filesystem. Any writes the container makes go to its read/write image; all the reads of the unchanged files are actually performed from image’s read-only root. The image’s filesystem root can be shared between all its running containers, and a started container uses only as much space as it has actually written. This conserves not only disk space – when you have multiple containers started from one image, the operating system can use the same memory cache for all of them for the files they share. This also makes the boot of the container pretty much instantaneous, as Docker doesn’t need to copy whole root filesystem.

This idea goes a bit further: the image itself is actually a frozen container. To prepare a new image, you just start a container, run one or more commands to install and prepare the application, and then commit the container as a new image. This means that your new image has only the files that have been added or changed since the image you started it from; and that base image has only files that have changed since its base image, and so on. At the very bottom there’s a base image that has been created from a filesystem archive – the only one that actually contains all of its files. There’s even a cool Dockerfile configuration that lets you describe how to build the application from the base image in one place. And this is where the layering goes a bit too far.

Two versions of the same Docker imageDocker itself is intentionally limited: when you start a container, you’re allowed to run only a single command, and that’s all. Then the container exits, and you can either dispose of it, or commit it as a new image. For running containers, it’s fine – it enforces clean design and separation of concerns. When building images, though, every “RUN” line means a new image is committed, which is a base for the next “RUN” line, and so on. When building any reasonably complex software based only on Dockerfile, we always and up with a whole stack of intermediate images that aren’t useful in any way. In fact, they are harmful, as there seems to be a limit on how many directories can you stack with aufs. It’s reported to be 42 layers, which is not too much considering size of some Dockerfiles floating around.

It seems that flattening existing images is not a simple task. Can we build images in one go, without stacking dozens of them on top of each other? It seems it’s actually quite easy. You just compress all the “RUN” Dockerfile statements into one shell script, and you’re halfway there. If you compress all the “ADD” statements into a single one too, you’re almost there: there’s one image for ADD, and a second one to RUN the setup script. Besides an unnecessary intermediate image being plain ugly, we have another issue: the ADD line often copies a big installer or package into an image, only to have it removed by a RUN line after installation. The user still has to download the intermediate image with the huge package file, only to not see it because child image had it deleted.

It turns out we can use shared volumes instead of the ADD statement. If we use docker run + docker commit manually rather than docker build with a Dockerfile, we can download all the installers on the builder host, expose it to the container as a shared volume, and then commit the container into an image in a single pass.

It’s possible to just write a shell script which does all of that manually. But the Dockerfile format is quite comfortable to use, and there’s quite a few container definitions already available that we wouldn’t be able to reuse. And it’s completely feasible to just read the Dockerfile syntax by a script that would execute it as a one-pass single-layer build:

  • Compose a single dict out of metadata commands, such as MAINTAINER, CMD, or ENTRYPOINT
  • For each RUN command, add a line to a setup shell script in a shared directory
  • For each ADD command, copy the data inside the shared directory, and add a line to the setup script that copies it to its final location
  • Build image in one pass, with a single docker run -v $shared_dir:/.data $from /.data/setup.sh, and if it’s successful, commit the image with a single docker commit using the metadata gathered earlier.

As it turns out, 125 lines of Perl is all it takes. The docker-compile.pl script is in the Gist below, and in the image to the right you can see inheritance diagram of the lopter/collectd-graphite image in two versions: on the right is the original, created by docker build; on the left is one created with docker-compile.pl. The flat one takes 299 MB; the combined ancestry of the original is almost 600MB.

All the script needs is Perl and JSON CPAN module (on Debian or Ubuntu, you can install it with sudo apt-get install libjson-perl). I hope the idea will prove useful – if there’s demand, I can take this proof of concept, polish it, document, and set up a proper GitHub repo with issue tracking and all. Happy hacking!