Flat Docker images

Docker seems to be the New Hot Thing these days. It is an application container engine – it lets you pack any Linux software into a self-contained, isolated container image that can be easily distributed and run on different host machines. An image is somewhere in between a well built Omnibus package, and a full-on virtual machine: it’s an LXC container filesystem plus a bit of configuration on top of it (environment variables, default command to run, UID to run it as, TCP/UDP ports to expose, etc). Once you have the image built or downloaded, you can use it to start one or many containers – live environments that actually run the software.

To conserve disk space (and RAM cache), Docker uses AUFS to overlay filesystems. When you start a container from an image, Docker doesn’t copy all the image’s files to the container’s root. It overlays a new read/write directory on top of read-only directory with the image’s filesystem. Any writes the container makes go to its read/write image; all the reads of the unchanged files are actually performed from image’s read-only root. The image’s filesystem root can be shared between all its running containers, and a started container uses only as much space as it has actually written. This conserves not only disk space – when you have multiple containers started from one image, the operating system can use the same memory cache for all of them for the files they share. This also makes the boot of the container pretty much instantaneous, as Docker doesn’t need to copy whole root filesystem.

This idea goes a bit further: the image itself is actually a frozen container. To prepare a new image, you just start a container, run one or more commands to install and prepare the application, and then commit the container as a new image. This means that your new image has only the files that have been added or changed since the image you started it from; and that base image has only files that have changed since its base image, and so on. At the very bottom there’s a base image that has been created from a filesystem archive – the only one that actually contains all of its files. There’s even a cool Dockerfile configuration that lets you describe how to build the application from the base image in one place. And this is where the layering goes a bit too far.

Two versions of the same Docker imageDocker itself is intentionally limited: when you start a container, you’re allowed to run only a single command, and that’s all. Then the container exits, and you can either dispose of it, or commit it as a new image. For running containers, it’s fine – it enforces clean design and separation of concerns. When building images, though, every “RUN” line means a new image is committed, which is a base for the next “RUN” line, and so on. When building any reasonably complex software based only on Dockerfile, we always and up with a whole stack of intermediate images that aren’t useful in any way. In fact, they are harmful, as there seems to be a limit on how many directories can you stack with aufs. It’s reported to be 42 layers, which is not too much considering size of some Dockerfiles floating around.

It seems that flattening existing images is not a simple task. Can we build images in one go, without stacking dozens of them on top of each other? It seems it’s actually quite easy. You just compress all the “RUN” Dockerfile statements into one shell script, and you’re halfway there. If you compress all the “ADD” statements into a single one too, you’re almost there: there’s one image for ADD, and a second one to RUN the setup script. Besides an unnecessary intermediate image being plain ugly, we have another issue: the ADD line often copies a big installer or package into an image, only to have it removed by a RUN line after installation. The user still has to download the intermediate image with the huge package file, only to not see it because child image had it deleted.

It turns out we can use shared volumes instead of the ADD statement. If we use docker run + docker commit manually rather than docker build with a Dockerfile, we can download all the installers on the builder host, expose it to the container as a shared volume, and then commit the container into an image in a single pass.

It’s possible to just write a shell script which does all of that manually. But the Dockerfile format is quite comfortable to use, and there’s quite a few container definitions already available that we wouldn’t be able to reuse. And it’s completely feasible to just read the Dockerfile syntax by a script that would execute it as a one-pass single-layer build:

  • Compose a single dict out of metadata commands, such as MAINTAINER, CMD, or ENTRYPOINT
  • For each RUN command, add a line to a setup shell script in a shared directory
  • For each ADD command, copy the data inside the shared directory, and add a line to the setup script that copies it to its final location
  • Build image in one pass, with a single docker run -v $shared_dir:/.data $from /.data/setup.sh, and if it’s successful, commit the image with a single docker commit using the metadata gathered earlier.

As it turns out, 125 lines of Perl is all it takes. The docker-compile.pl script is in the Gist below, and in the image to the right you can see inheritance diagram of the lopter/collectd-graphite image in two versions: on the right is the original, created by docker build; on the left is one created with docker-compile.pl. The flat one takes 299 MB; the combined ancestry of the original is almost 600MB.

All the script needs is Perl and JSON CPAN module (on Debian or Ubuntu, you can install it with sudo apt-get install libjson-perl). I hope the idea will prove useful – if there’s demand, I can take this proof of concept, polish it, document, and set up a proper GitHub repo with issue tracking and all. Happy hacking!