Containerization technologies like Docker are designed to solve challenges associated with installing and running complex software such as bioinformatics pipelines and web servers.
Containerization is a process where all of components needed to run programs, or systems, such as a modern web-based content management system (CMS), are packaged together into a file (called an image). When the image is instantiated (copied and set to running) the resulting instance - the container - runs the software just as it would in the original computing system. Containers allow for software to be developed in one environment and then run in many others provided there is a application - the container’s engine - that is installed on the desired computers.
Unlike virtualization, which is running a computer within a computer, container images are generally smaller than virtual machine images and consume fewer resources because their memory and other resources are shared with the host computer. Hence, containers enable software replication and use. In a field, like bioinformatics, where complex software systems are the norm, containers are important and their use is quickly growing.
In a previous post I discussed how I attended a Cyverse Container Camp hosted at the University of Arizona, Tucson AZ as part of my efforts to design the advanced bioinformatics portion of Shoreline Community College Immuno-biotechnology certificate. Through this course I learned that containers open new possibilities for using bioinformatics in biology classes because they can hide the systems administration and other operational details that make bioinformatics software hard to work with and limit adoption.
A very brief history
The first day of the camp focused on Docker. Docker turns five this year (March 2018). While it may appear that containers have been around for only a short period of time, the ideas for containers and early implementations began 40 years ago with Unix V7 and chroot . As a first method, chroot provided a way to partition file systems so that software could be run as separate instances to improve security and access. Since then, chroot has evolved and is still in use. At Geospiza we used chroot to partition iFinch instances for our customers (iFinch [ca. 1999] became Finch Lab, which became GeneSifter Lab Edition). Using chroot we could provide individual labs their own iFinch, but do so on a single computer system.
In about 2000, and over the next decade, containerization continued to develop, but widespread use was limited. Systems could be run as individual instances, but these systems had to reside in a single host environment or operating system; they lacked a wide degree of portability. As containers advanced to overcome these problems, Docker broke free by developing a complete ecosystem for container management. This complete ecosystem includes an open-source engine to run containers on many operating systems, methods and tools to package software into containers, and a service to make images available to different communities via the web and the Docker Engine. Through these efforts Docker moved containers from early adopters to mainstream development communities .
According to our class presenters, the Docker community has over 3300 project contributors and 14M hosts. Docker itself claims 450 million downloads. For perspective, Linux claims 1681 developers for its Linux 4.7 Kernel, and over 15,000 individuals have contributed to Linux since 2005 .
Docker and bioinformatics
As noted Docker popularity is growing in bioinformatics. A Google search for “Docker and bioinformatics” returns over 100,000 links to opinions in blogs, papers, github, dockerhub, and other sites. Google's recommended searches related to Docker and bioinformatics include biodocker and biocontainers. Biodocker is another bio-thing, like bioperl, biopython, biojava, bio-yournamehere, there is a biodocker. This biodocker, however, is really a name for biocontainers, that is biodocker.org redirects to BioConainers.pro. BioContainers is an effort to bring standardization and verification to bioinformatics containers.
Many of the most popular, difficult-to-install, programs, such as BLAST, are being containerized (a search, "docker and BLAST" yields 236,000 links). Galaxy, a popular web-based bioinformatics system that is good for learning bioinformatics and low throughput exploratory bioinformatics, has many Docker containers as well. The BioContainer site links to a github repository that has 69 Biocontainers. Many of these are also on Quay.io - another Docker distribution site. A “bio” search on Docker Hub returns 4540 hits. Scanning a few pages indicates that many are likely bioinformatics related.
Docker and the Law of Leaky Abstractions
Docker will change the world ... maybe. While clearly powerful and enabling, the magic of Docker can also be an overpromise. Back in the early days of the modern Internet (ca. 2000), Joel Spolsky wrote a series of blogs on software development. One of my favorites, “The Law of Leaky Abstractions,” speaks to a core issue in software. It is based on an understating that all programs, all systems, and all computer magic is based on APIs (application programing interfaces). APIs abstract computing details through layers of software functionality that range from low level language specific libraries to the REST APIs of the modern Internet. As abstractions increase in diversity and complexity, they make it possible to develop increasingly more powerful software in rapid ways. Indeed, abstractions are the key to achieving the Technical Singularity. The Law of Leaky Abstractions also recognizes that APIs are human constructions. Thus, they are based on transient assumptions that may or may not be correct, and they contain bugs that can interfere with operation.
Docker is no exception. As an example, in the class we had an assignment to build a docker container. We had been learning how easy it is, now was the time to try. After some trial and error I built mine. First abstraction leak; nothing is ever as easy as it seems. Second abstraction leak; something went wrong on my laptop, and the errors pointed to a solution, but the solution was not to the real problem.
When I ran my container it had errors indicating that the language environment variables (ENV, and the incomprehensible UTF-8, and other syntactically obscure settings) were not correctly set. No one else was having this problem. I also noticed that containers with IDs but lacking names were building up. Interestingly I could run this container with Singularity (another story) without problem, so the issue was likely not the container, but the engine running it.
After, I got home, I tried to repeat the above exercise while fruitlessly trying to find solutions to the above ENV problem and none made sense. I tried one more thing. I performed a “hard” reset on my environment by deleting my Docker images and starting over. This worked! The container ran and I went on to other test different kinds of Docker containers such as Galaxy. As a side note, the Docker Engine has a reset command which does something similar in that it will clear the containers and images. It’s kind of like Apple’s "did you zap your PRAM or reinstall the OS suggestion?" when things are going very badly and no one is sure what to do.
The point of the above story is that the Docker Engine is the abstraction layer between the containers and the operating system (see General Docker architecture). The DE (an abstraction for Docker Engine) does many things: it arbitrates commands issued from the host operating system (OS) to get and build images, launch containers, and enable container capabilities that interact with the OS. To have containers run on a wide range of operating systems there must be a DE for each operating system and the DE needs to interface with the OS's APIs. This is Docker's strength. It is also Docker Achilles heel ... more to follow.
 A brief history of containers and Docker: https://blog.aquasec.com/a-brief-history-of-containers-from-1970s-chroot-to-docker-2016
 A terse description of chroot: https://en.wikipedia.org/wiki/Chroot
 To learn more about high-tech marking, I recommend Geoff Moore's "Crossing the Chasm," and "Inside the Tornado."
 The Linux Kernel Development Report 2017: https://www.linuxfoundation.org/blog/2017-linux-kernel-report-highlights...
Acknowledgment of Support: This material is based in part on work supported by the National Science Foundation under Grant Number DUE 1700441. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.