Categories:	HowTos
Tags:	Container Docker Kubernetes

For some time now, a large part of the IT landscape has been talking only about “containers”, “microservices” and “Kubernetes.”

But what exactly are containers and what technical basis are they based on?

General

Simply put, a container is an isolated runtime environment for processes. There are various areas that can be separated – the most important being processes (pid), network (net), volumes / hard disks (mnt) and user / group IDs (user).

The technology behind this is called “namespaces” and was first implemented in the Linux Kernel since version 2.4.19 (2002) and later expanded, but only since version 3.8 (2013) in userspace, i.e. meaningfully usable for users. In addition, the cgroups technology plays a major role here. This makes it possible to provide the separate areas with resources such as CPU and RAM, or to define them.

A well-known and early implementation of these features is lxc (linux containers), which is still being developed today and implements these features close to the system.

Linux Namespaces

A namespace is a way to divide resources and objects into logical groups. You could also describe it as a system context in which a process is started. It is not a problem to create your own namespaces within a namespace for newly started processes.

An example from daily practice:

When a Linux host starts, an instance is created for each namespace type. The init process with PID 1 (usually systemd today) is then assigned to the instances accordingly. This is transparent and only of limited relevance for most users. This is because all system resources are available to these namespaces and new resources are initially assigned to them.

To view a list of the namespaces currently running on the system, there is the tool lsns.

In the following example, we see the initially created namespaces and the assignment of the init process.

[root@buildah ~]# lsns -p1
        NS TYPE   NPROCS PID USER COMMAND
4026531835 cgroup     96   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18
4026531836 pid        96   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18
4026531837 user       95   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18
4026531838 uts        96   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18
4026531839 ipc        96   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18
4026531840 mnt        90   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18
4026531992 net        96   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18

A process can only ever be assigned to one namespace per type. For example, the process with PID 1 from the example above cannot be assigned an additional pid namespace.

The different types have no interactions or dependencies with each other. For example, you can assign a new process (e.g. a shell) only its own net-namespace.

To create a new namespace as a user, there is the tool unshare. Using parameters, it is possible to specify which namespace types should be created for the process.

The following is an example of how a container-like environment (in which all possible areas are separated from the host system) can be created manually.

To do this, we start a bash shell with the parameters shown as a normal user without root permissions.

[podmanager@buildah ~]$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash

[root@buildah ~]# id
uid=0(root) gid=0(root) Gruppen=0(root) Kontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

[root@buildah ~]# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

[root@buildah ~]# lsns
        NS TYPE   NPROCS   PID USER COMMAND
4026531835 cgroup      3   952 root unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash
4026531836 pid         1   952 root unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash
4026532192 user        3   952 root unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash
4026532193 mnt         3   952 root unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash
4026532194 uts         3   952 root unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash
4026532196 ipc         3   952 root unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash
4026532198 pid         2   953 root /bin/bash
4026532200 net         3   952 root unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash

As you can see, the process is now in an encapsulated area with its own IDs and network area. However, the entire hard disk configuration was also transferred to the new area, including the /proc folder, in which the processes of the host system are listed.

[root@buildah ~]# ps -ef f  | head
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
nobody       2     0  0 14:22 ?        S      0:00 [kthreadd]
nobody       3     2  0 14:22 ?        I<     0:00  \_ [rcu_gp]
nobody       4     2  0 14:22 ?        I<     0:00  \_ [rcu_par_gp]
nobody       6     2  0 14:22 ?        I<     0:00  \_ [kworker/0:0H-kblockd]
nobody       8     2  0 14:22 ?        I<     0:00  \_ [mm_percpu_wq]
nobody       9     2  0 14:22 ?        S      0:00  \_ [ksoftirqd/0]
nobody      10     2  0 14:22 ?        I      0:00  \_ [rcu_sched]
nobody      11     2  0 14:22 ?        S      0:00  \_ [migration/0]
nobody      12     2  0 14:22 ?        S      0:00  \_ [watchdog/0]

To correct this, a mount -t proc proc /proc is required. This overlays the /proc of the host system, meaning that only the processes of the new environment are visible.

[root@buildah ~]# mount -t proc proc /proc
[root@buildah ~]# ps -ef f 
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
root         1     0  0 15:05 pts/1    S      0:00 /bin/bash
root        27     1  0 15:11 pts/1    R+     0:00 ps -ef f

To leave the environment, simply type exit, or the key combination STRG+D.

[root@buildah ~]# exit
exit
[podmanager@buildah ~]$

cgroups

Control Groups (cgroups for short) are not a direct namespace, but allow processes to be grouped in a type of namespace and the available resources such as CPU and / or RAM to be limited or prioritized.

There is now an updated version of cgroups (cgroupsV2) in the kernel. However, this is probably only used productively by Fedora >= 31, as there are some incompatibilities with Docker here, but not with Podman.

However, the setup is somewhat more complex and will therefore not be explained here, but will be the subject of a separate article.

Further information can be found for those interested here (cgroupsv1) and here (cgroupsv2)

The file system of a container

Each container contains all the components necessary for the operation of the binaries, such as libraries and binaries.
The only dependency on the host system is generally that the applications must be able to run on the kernel of the host system.

However, the file system is not a separate hard disk or similar, but merely an archive that contains a directory tree.

This archive is then unpacked into a folder at the latest when a container is started and used as a new file system on this folder by means of an mnt namespace and chroot. The chroot changes the entry point for the file system to which the user has access. For example, /var/lib/docker/container1/dateisystem on the host becomes the new / within the container.

Here is an example with the separated environment from the previous section.

First, we export the file system of the Postgres container as a tar archive and then unpack it into a subfolder.

[podmanager@buildah ~]$ podman export 3b62694339c6 -o postgres_container.tar
[podmanager@buildah ~]$ ls -l postgres_container.tar 
-rw-r--r--. 1 podmanager podmanager 313597440 31. Mär 15:31 postgres_container.tar
[podmanager@buildah ~]$ mkdir postgres_root
[podmanager@buildah ~]$ tar -xf postgres_container.tar -C postgres_root/
[podmanager@buildah ~]$ ls -l postgres_root/
insgesamt 12
drwxr-xr-x.  2 podmanager podmanager 4096  3. Mär 01:27 bin
drwxr-xr-x.  2 podmanager podmanager    6  1. Feb 18:09 boot
drwxr-xr-x.  2 podmanager podmanager    6 24. Feb 01:00 dev
drwxr-xr-x.  2 podmanager podmanager    6  3. Mär 01:27 docker-entrypoint-initdb.d
lrwxrwxrwx.  1 podmanager podmanager   34  4. Mär 18:35 docker-entrypoint.sh -> usr/local/bin/docker-entrypoint.sh
drwxr-xr-x. 37 podmanager podmanager 4096 31. Mär 14:24 etc
drwxr-xr-x.  2 podmanager podmanager    6  1. Feb 18:09 home
drwxr-xr-x.  8 podmanager podmanager   96 26. Feb 01:54 lib
drwxr-xr-x.  2 podmanager podmanager   34 24. Feb 01:00 lib64
drwxr-xr-x.  2 podmanager podmanager    6 24. Feb 01:00 media
drwxr-xr-x.  2 podmanager podmanager    6 24. Feb 01:00 mnt
drwxr-xr-x.  2 podmanager podmanager    6 24. Feb 01:00 opt
drwxr-xr-x.  2 podmanager podmanager    6  1. Feb 18:09 proc
drwx------.  2 podmanager podmanager   76 31. Mär 14:44 root
drwxr-xr-x.  5 podmanager podmanager   84 31. Mär 14:24 run
drwxr-xr-x.  2 podmanager podmanager 4096  3. Mär 01:27 sbin
drwxr-xr-x.  2 podmanager podmanager    6 24. Feb 01:00 srv
drwxr-xr-x.  2 podmanager podmanager    6  1. Feb 18:09 sys
drwxrwxr-x.  2 podmanager podmanager    6  3. Mär 01:27 tmp
drwxr-xr-x. 10 podmanager podmanager  105 24. Feb 01:00 usr
drwxr-xr-x. 11 podmanager podmanager  139 24. Feb 01:00 var

Now we create a shell with its own namespaces again and execute the chroot.

[podmanager@buildah ~]$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash

[root@buildah ~]# cat /etc/redhat-release 
CentOS Linux release 8.1.1911 (Core)

[root@buildah ~]# chroot postgres_root

root@buildah:/var/lib/postgresql/data# /bin/cat /etc/issue
Debian GNU/Linux 10 \n \l

Finally, the /proc file system must be corrected, as mentioned.
Once this is done, we have the same working environment that we would have in a container.

root@buildah:/var/lib/postgresql/data# /bin/mount -t proc proc /proc

root@buildah:/var/lib/postgresql/data# /bin/ps -ef f 
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
root         1     0  0 13:34 ?        S      0:00 /bin/bash
root        27     1  0 13:35 ?        S      0:00 /bin/bash -i
root        54    27  0 13:43 ?        R+     0:00  \_ /bin/ps -ef f

Conclusion

This is of course only a quick overview of the technical framework of container technology, which is based on known features of the Linux kernel.
Docker and also Podman use these features, but offer many more functions and, above all, convenience functions for handling them.

At the latest when using container orchestration tools such as Kubernetes or okd, several layers of complexity are also added.

If you have any questions about the use of containers, please do not hesitate to contact us. Contact us!

Categories:	HowTos
Tags:	Container Docker Kubernetes

About the author

Danilo Endesfelder

Berater

about the person

Danilo ist seit 2016 Berater bei der credativ GmbH. Sein fachlicher Fokus liegt bei Containertechnologien wie Kubernetes, Podman, Docker und deren Ökosystem. Außerdem hat er Erfahrung mit Projekten und Schulungen im Bereich RDBMS (MySQL/Mariadb und PostgreSQL®). Seit 2015 ist er ebenfalls im Organisationsteam der deutschen PostgreSQL® Konferenz PGConf.DE.

View posts

Share this post: