Chasing Kubernetes: Writing a container runtime from scratch

I have a longer plan. I want to build a Kubernetes scheduler from scratch, and a lot of things need to fall into place before I get there.

The first is understanding what a container actually is. Not the concept — I know the concept. I wanted the feel of it, the actual moving parts. So I built a small one myself, just something that can run an isolated process on the same machine. I called it forker.

What is a namespace

When you run a program, Linux gives it a process ID and lets it see everything on the machine: other processes, the filesystem, the network. Everything is shared by default.

A namespace changes that. It tells the kernel to give a process its own isolated view of something. Linux has a few kinds:

UTS — the process gets its own hostname
PID — the process thinks it is PID 1 and cannot see anything outside
Mount — the process gets its own filesystem view
Network — the process gets its own network stack
IPC — the process gets its own communication channels

These are not Docker features. They are Linux kernel features that have existed since the mid-2000s. Docker just wrapped them so you never had to think about them.

Creating the namespace

In Go you just pass some flags when spawning a child process.

runtime/run.go

childCmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUTS |
                syscall.CLONE_NEWPID |
                syscall.CLONE_NEWNS  |
                syscall.CLONE_NEWIPC |
                syscall.CLONE_NEWNET,
}

Each flag tells the kernel to give that child process a fresh, isolated version of that resource. Same Linux kernel underneath, but the process sees a completely different world.

The tricky part: parent and child

You cannot set up a namespace from inside it. The parent process has to configure the isolation before the child starts. But some setup work — like mounting filesystems and bringing up the network — can only happen from inside the namespace after it exists.

The solution is to run the same binary twice. Here is how it works:

You run forker run go run server.go — a tiny Go HTTP server listening on :8080
The parent creates a child process with all the namespace flags set
That child is not the Go server yet — it is forker itself running again, inside the new namespace
This second forker does all the setup work: sets the hostname, remounts /proc, brings loopback up
Only after setup is done does it exec into go run server.go, which becomes PID 1 inside the sandbox and binds to :8080

In forker the binary just checks one environment variable at startup to know which role it is playing.

main.go

func main() {
    if runtime.IsChildProcess() {
        runtime.ChildMain()
        return
    }
    runtime.Run(os.Args)
}

The parent sets __FORKER_CHILD__=1 before launching the child. The child wakes up, sees that variable, and knows it is the one running inside the sandbox. This is the same trick runc uses, and runc is what Docker uses under the hood.

Why this is the key idea behind Kubernetes replicas

Here is what surprised me once I had it working. You can run the same binary three times, each one thinking it is PID 1, each one binding to port 8080, and none of them can see each other — all on the same physical machine.

Interactive: same server.go, three sandboxes, one host

Press Play to watch the daemon spawn three isolated replicas, or step through manually.

That is exactly what a Kubernetes replica set is. When you set replicas: 3, the scheduler is placing three isolated copies of the same process on nodes, each with their own network namespace, each oblivious to the others. The orchestration layer handles routing traffic between them. The isolation is just Linux namespaces underneath.

Setup inside the sandbox

Once the child process is running inside its namespaces, three things need to happen before the actual program starts.

Hostname. One syscall to give the sandbox its own name.

syscall.Sethostname([]byte(cfg.Hostname))

Filesystem. Mount namespaces are not fully isolated by default, so mounts can leak back to the host unless you opt out explicitly.

syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, "")
syscall.Mount("proc", "/proc", "proc", syscall.MS_NOSUID|syscall.MS_NODEV|syscall.MS_NOEXEC, "")

Without remounting /proc, the child can see every process on the host machine.

Network. Inside a new network namespace, loopback is dead by default. Even 127.0.0.1 does not work until you bring the interface up.

exec.Command("ip", "link", "set", "lo", "up").Run()

I found this out because localhost was broken inside the sandbox and I had no idea why.

What this is not

forker is not fully isolated and I want to be upfront about that.

From inside the sandbox you can still access the host filesystem because I never changed the root with chroot. There are also no resource limits — no cgroups. One bad process inside can eat all the memory and CPU on the host and nothing will stop it.

Docker handles both of these. It uses a proper root filesystem to lock the container into its own directory tree and uses cgroups to cap memory and CPU per container. I could add these, and if I did I would probably also want to think about volume mounts and container persistence. That starts to become a real project on its own.

But the goal is not to build a complete container runtime. The goal is to understand enough to eventually build a Kubernetes API server and scheduler, to get to a point where nothing in these tools feels like a black box. Even understanding 10% of what they do changes how you think. Everything stops being magic and starts being just Linux.

What is next

Right now forker can run an isolated process with a separate hostname, a separate PID space, and a separate filesystem view.

But there is one big thing missing. If I run a server inside the namespace I cannot reach it from the host machine — no curl, no browser, nothing. The network namespace is isolated but there is no bridge connecting it back out.

The next step is networking: a virtual network interface inside the sandbox, connected to the host through a bridge, with port forwarding working. Only then does this start to feel like something real.

That is week 3.