In our previous blog post, we introduced containers, how they worked, and why so many applications are being moved to containers. And while we did gloss over the most fundamental difference between virtual machines and containers—state—we did not dive into all of its ramifications.
Stateful vs. Stateless
Containers are stateless, which means changes made to the container itself are lost after the container is stopped or spun up on another host. The impact of this simple fact is huge. In this blog post, we’ll discuss the impact on:
- Application architecture
- Persistent data
Application Architecture
Because a container is stateless, applications must be specifically and explicitly designed to run in a container to prevent data loss and configuration issues.
A container disk image is a layer cake of read-only layers where each layer is created from a configuration file with installation and configuration commands. While running containers do have a read-write top layer, that layer is not persistent, meaning it is lost if the container is terminated or the host has a catastrophic failure. Think of the top writable layer as ephemeral and temporary. It should only be used for temporary files while the container is running, rather than for persistent data.
Containers are also immutable, so application binaries and application configuration must be stored differently. In virtual machines, all files are stored in a single disk image. With containers, each layer adds a specific part of the application stack. For example, intermediate layers may contain required dependencies to run the application, like a web server, java, or binary tools. These dependencies may be built automatically based on the latest build configuration files provided by IT Ops. The DevOps team then builds the latest application release image using a CI/CD pipeline.
Application configuration changes significantly between different environments such as testing, acceptance, staging, and production. It can also change across cloud availability zones or between different application releases as functionality and code evolve. Storing this information inside the container means creating a single image for a specific environment, negating many of the advantages of using containers.
Kubernetes stores configuration as key/value pairs in a ConfigMap, which containers consume as an environment variable, configuration file, or command-line argument. This is a radically different approach that has an impact on data protection, but more on that later. First let’s look at persistent application data.
Persistent Data
Persistent application data is stored in one of three forms: block, file, or object. Enterprise storage environments often have storage appliances that serve block-based storage to the virtualization platform. For example, a LUN with a VMFS filesystem that stores VMDK files (block-based devices used by VMS to store data) in a VMware environment.
If containers are used properly, the process is similar. One of the major advantages of using Kubernetes is the more mature functionality (the concept of PersistentVolumes, or PV) used to present storage to containers. A PV is any block-based storage, like a LUN, presented to a container; it can be used as a regular hard disk by the container. A PV stores persistent application data because it is mutable and changes are not lost when the container is stopped, destroyed, or spun up on another container host.
So how do you protect these Persistent Volumes?
Container Data Protection
Simply put, data protection must be specifically designed to protect persistent data for containerized applications.
The container images themselves are immutable, and therefore less critical to protect (although a case could be made to protect the latest version of each image for faster restores). The configuration files and build pipelines that produce the various container images are the more important aspect of the container images to protect, as they are the factories that produce the images. That means protecting the build server, the repository where built images and other binaries and artifacts are stored, as well as the code repositories that sit at the source of each pipeline. These elements may be virtual machines running in a datacenter, warranting protection by existing data protection solutions.
The configuration items need protection as well. They can be protected from the source in the code repository or captured from the live environment (across testing, acceptance, staging, and production) for quicker disaster recovery.