Kubernetes is a complex piece of technology that abstracts away many system administration tasks, but does also solve and automate some processes useful at a smaller scale, like blue-green deployments. Having administered managed Kubernetes for a while now, I wanted to find out what a self-managed, small-but-multi-node Kubernetes install looks like.
Most of the non-Kubernetes machines I manage are individual machines, or single database + multiple workers. For this step I'm not really interested in much more than that, like making everything redundant, self-healing, etc. I just want to introduce Kubernetes in something that matches my existing setups.
Getting things fully functional was a long process of trial-and-error, during which I learned about even more things I didn't want to touch:
Public-Key Infrastructure (PKI). Kubernetes definitely leans into this and prefers you manage keys and certificates for all of its components, but I feel like this is a whole separate article in itself.
The NixOS Kubernetes modules. These have their own opinions, and there's nothing wrong with their implementation, but using them goes against some of the learning and experimenting I wanted to do here.
K3s, K0s or any other Kubernetes 'distribution'. These are an extra layer to learn, and an extra layer to trust. They sometimes offer valuable extra functionality, for example I wish the SQLite backend was in upstream Kubernetes. But again, I avoided these in the interest of learning.
NixOS in general is great, and I'm a big fan, but something Kubernetes can potentially do well (in terms of configuration) is provide a clear boundary between the system and application. In NixOS, configuring an app is often interwoven with system config, and there's a lack of options to prevent that.
Still, I'll be using the Kubernetes package (not module!) from Nixpkgs, as well as building everything on top of NixOS and its excellent systemd parts.
A fully functioning QEMU setup for the end result can be found at: https://codeberg.org/kosinus/nixos-kubernetes-experiment
Basic NixOS configuration
At the time of writing, NixOS 25.11 is mere weeks away, so that is my target.
There's a bunch of stuff I enable on all of my NixOS machines that is relevant to the rest of this article.
I prefer nftables over iptables, because it's the future. In practice, the
iptables command is already a compatibility layer in many Linux distributions, but these options additionally enable the nftables-based firewall in NixOS:
{
networking.nftables.enable = true;
# We want to filter forwarded traffic.
# Also needed for `networking.firewall.extraForwardRules` to do anything.
networking.firewall.filterForward = true;
}
I enable systemd-networkd, because it's the future. I wouldn't even know how to set up all the networking parts in other setups; systemd-networkd is just really nice when you have a bunch of moving parts in your networking.
{
networking.useNetworkd = true;
}
Kubernetes version
The current version of Kubernetes at the time of writing is 1.34. It's useful to check the package version, because Kubernetes requires step-by-step minor version upgrades:
{ lib, pkgs, ... }:
{
# Ensure we carefully upgrade Kubernetes versions.
# We need to step 1 minor version at a time.
assertions = [
{
assertion = lib.hasPrefix "1.34." pkgs.kubernetes.version;
message = "Unexpected Kubernetes package version: ${pkgs.kubernetes.version}";
}
];
}
Networking
If you've ever used Docker or Podman, your typical networking setup looks like this:
The machine is logically split up in host and container network namespaces. Each container is assigned half of a veth pair, the other half is part of a a bridge interface on the host. The host assigns a subnet to the bridge with an address for itself, like
172.16.0.1/24, and an address for each container. The host is then the gateway for containers, performing layer 3 routing and NAT on outgoing traffic to the internet.
Kubernetes wants you to connect these container subnets across multiple machines. In this article I assume there is a private network connecting all nodes together:
In addition to the 'outward' link from the host to the internet, the host now has an additional outward link to a network switch that brings hosts together in a private network. We intend to route traffic between container subnets across this private network somehow. Notably, NAT is still only performed on traffic to the internet, and not traffic between containers.
Even if you have a private network like this, you may not be able to simply route traffic from container subnets across it. Cloud providers often restrict the addresses a machine can use on its network interface to what is preconfigured in the cloud resources.
There are a lot of ways to actually connect the subnets together, but I chose Wireguard because I know it, and because I wanted to test drive the overhead of encrypted links with real applications. It's potentially an additional layer of security if you're running this on the network of a cloud provider that otherwise doesn't encrypt customer traffic on internal networks. (But some may call you paranoid.)
Some alternatives here:
- Use some other tunneling protocol like GENEVE or VXLAN. Maybe GRE works too?
- Instead use TLS at the application layer for securing connections, e.g. HTTPS between proxy and backend, TLS to your database, etc.
- If you control the physical network (or even just layer 2), you can actually connect containers directly to the network using macvlan and even have your existing DHCP server assign addresses.
- Something like flannel can help you make the whole setup dynamic, if your machines tend to come and go.
Container subnets
First, let's determine our addressing scheme for all of our containers across machines.
{ config, lib, ... }:
{
# I like to create NixOS options for variables that are going to be used
# across multiple files, so I can reach them (without imports) via the
# `config` parameter of a NixOS module.
options.kube = {
# We're going to assign each node a one-based index, and derive the
# container subnet from that.
nodeIndex = lib.mkOption { type = lib.types.ints.positive; };
# Having a zero-based index on hand will become useful later,
nodeIndex0 = lib.mkOption {
type = lib.types.ints.unsigned;
default = config.kube.nodeIndex - 1;
};
# Functions that take a node index and build a subnet in CIDR-notation.
mkNodeCidr6 = lib.mkOption {
type = with lib.types; functionTo str;
default = index: "fd88:${toString index}::/32";
};
mkNodeCidr4 = lib.mkOption {
type = with lib.types; functionTo str;
default = index: "10.88.${toString index}.0/24";
};
# On each node, the host will take the first IP in the subnet.
# Containers will use this IP as the gateway.
mkHostIp6 = lib.mkOption {
type = with lib.types; functionTo str;
default = index: "fd88:${toString index}::1";
};
mkHostIp4 = lib.mkOption {
type = with lib.types; functionTo str;
default = index: "10.88.${toString index}.1";
};
# For each of the above functions, define the values for the local node.
nodeCidr6 = lib.mkOption {
type = lib.types.str;
default = config.kube.mkNodeCidr6 config.kube.nodeIndex;
};
nodeCidr4 = lib.mkOption {
type = lib.types.str;
default = config.kube.mkNodeCidr4 config.kube.nodeIndex;
};
hostIp6 = lib.mkOption {
type = lib.types.str;
default = config.kube.mkHostIp6 config.kube.nodeIndex;
};
hostIp4 = lib.mkOption {
type = lib.types.str;
default = config.kube.mkHostIp4 config.kube.nodeIndex;
};
# The zero subnet is for Kubernetes Cluster IPs used in Service resources.
# NOTE: Would love to use IPv6 here, but that is trouble for many apps.
servicesCidr = lib.mkOption {
type = lib.types.str;
default = "10.88.0.0/24";
};
};
}
Now each machine needs to assign the node index in per-machine configuration:
{
kube.nodeIndex = 1;
}
Now we have everything to configure the bridge interface we'll connect containers to. Unlike Docker / Podman, we'll be managing this manually:
{ config, pkgs, ... }:
{
# We need a separate netdev unit to create the bridge interface.
systemd.network.netdevs."10-brkube" = {
netdevConfig = {
Kind = "bridge";
Name = "brkube";
};
};
# Now configure the interface with a network unit.
systemd.network.networks."10-brkube" = {
matchConfig = {
Name = "brkube";
};
networkConfig = {
# We want this interface to always be configured and have addresses.
# Bridges specifically report no-carrier while there are no members.
ConfigureWithoutCarrier = true;
# Disable all link-local addressing. (`169.254.0.0/16` / `fe80::/64`)
LinkLocalAddressing = false;
# Don't allow containers to maliciously become IPv6 routers.
IPv6AcceptRA = false;
};
# Configure the host addresses.
# This also configures the direct routes on the host.
#
# NOTE: Disable DuplicateAddressDetection because otherwise the address
# can remain in a 'tentative' state, and Linux won't allow us to use it
# as a source address in other routes. This is important for later.
addresses = [
{
Address = "${config.kube.hostIp6}/32";
DuplicateAddressDetection = "none";
}
{
Address = "${config.kube.hostIp4}/24";
DuplicateAddressDetection = "none";
}
];
};
# To inspect the bridge interface at runtime using the `brctl` tool.
environment.systemPackages = [ pkgs.bridge-utils ];
}
Next we can setup the Wireguard links. For this we need to generate keypairs, and it is at this point that we introduce secrets into the NixOS config. I like to use agenix for this, but there are other choices here, like sops-nix. With agenix, machines decrypt files using their OpenSSH host key.
For simplicity, I'm going to put all keys in keys/ directory, and add a master key so we can always edit all files locally:
mkdir keys
cd keys/
# Handle this private key file with care!
# The public key is printed on success.
age-keygen -o master_key
Now create a keys/secrets.nix configuration file for agenix:
let
# Public key printed by age-keygen above.
# The master key should be included in every set of publicKeys.
master = "age...";
# OpenSSH host keys of our nodes.
node1 = "ssh-ed25519 AAA...";
node2 = "ssh-ed25519 AAA...";
in
{
# Set recipients of Wireguard private keys to their respective nodes.
"wgkube1.key.age".publicKeys = [ master node1 ];
"wgkube2.key.age".publicKeys = [ master node2 ];
}
Then generate the Wireguard keys and immediately encrypt them:
wg genkey | agenix -i master_key -e wgkube1.key.age
wg genkey | agenix -i master_key -e wgkube2.key.age
Now we can decrypt these files in NixOS configuration:
{ config, ... }:
{
# This will make the private key available in `/run/agenix/` as `wgkube.key`.
age.secrets."wgkube.key" = {
file = ./keys + "/wgkube${toString config.kube.nodeIndex}.key.age";
# Make sure systemd-networkd can read this file.
group = "systemd-network";
mode = "0440";
};
}
Next I like to use a peers.json as input to generate the Wireguard configuration. That JSON looks like this:
[
{
"PublicKey": "pHEYIfgWiJEgnR8zKYGnWlbZbQZ0xb5eEyzVSpzz3BM=",
"PeerIP": "192.168.0.1"
},
{
"PublicKey": "TPB2lwnWPjjAZ1Pnn5A6sdhGAePztE5VlbQ/RmU89w4=",
"PeerIP": "192.168.0.2"
}
]
This array is ordered by node index. You can get the public keys as follows:
agenix -i master_key -d wgkube1.key.age | wg pubkey
agenix -i master_key -d wgkube2.key.age | wg pubkey
The PeerIP fields are local network IPs in this example. These could be IPs on the private network provided by your cloud provider, but because this is Wireguard, you can also safely cross the internet. (Though the internet is not necessarily always fast, reliable and within you control.)
I use a JSON file like this because I actually generate it using tofu, but to keep things focused, the tofu configuration will not be in scope of this article. There is a neat little Wireguard provider for it, though.
Now we can configure the links in NixOS:
{
config,
lib,
pkgs,
...
}:
let
# Grab helpers and variables.
# NOTE: Some of these are defined below.
inherit (config.kube)
mkNodeCidr6
mkNodeCidr4
nodeIndex0
wgPort
peers
;
in
{
options.kube = {
# Define the Wireguard port.
# This variable is useful later in firewall config.
wgPort = lib.mkOption {
type = lib.types.port;
default = 51820;
};
# Parse the `peers.json` file.
peers = lib.mkOption {
type = with lib.types; listOf attrs;
default = builtins.fromJSON (builtins.readFile ./keys/peers.json);
};
};
config = {
# We need a separate netdev unit to create the Wireguard interface.
systemd.network.netdevs."11-wgkube" = {
netdevConfig = {
Kind = "wireguard";
Name = "wgkube";
};
wireguardConfig = {
PrivateKeyFile = config.age.secrets."wgkube.key".path;
ListenPort = wgPort;
};
# Generate Wireguard peers from the JSON input.
wireguardPeers = lib.pipe peers [
(lib.imap1 (
index: entry: {
PublicKey = entry.PublicKey;
Endpoint = "${entry.PeerIP}:${toString wgPort}";
# This instructs Wireguard what ranges belong to what peers. It'll
# reject incoming traffic from an incorrect subnet, but also direct
# outgoing traffic to the correct peer based on this. Note that
# this doesn't create routes, however; we do that below.
AllowedIPs = [
(mkNodeCidr6 index)
(mkNodeCidr4 index)
];
}
))
# Filter out ourselves based on index.
# There's unfortunately no ifilter1 for one-based indexing.
(lib.ifilter0 (index0: value: index0 != nodeIndex0))
];
};
# Now configure the interface with a network unit.
systemd.network.networks."11-wgkube" = {
matchConfig = {
Name = "wgkube";
};
networkConfig = {
# Set these options for reasons similar to brkube.
ConfigureWithoutCarrier = true;
LinkLocalAddressing = false;
IPv6AcceptRA = false;
};
# Configures routes for the container subnets of peers.
#
# NOTE: We don't need to configure an address on this interface. As
# long as we route traffic destined for other nodes to this interface,
# Wireguard will send it to the correct peer based on AllowedIPs.
#
# For traffic from the host itself (not forwarded for containers), we
# set PreferredSource to the host IP from brkube.
routes = lib.pipe peers [
# NOTE: This results in a list of lists.
(lib.imap1 (
index: entry: [
{
Destination = mkNodeCidr6 index;
PreferredSource = config.kube.hostIp6;
}
{
Destination = mkNodeCidr4 index;
PreferredSource = config.kube.hostIp4;
}
]
))
# Filter out ourselves based on index.
(lib.ifilter0 (index0: value: index0 != nodeIndex0))
# After filtering we can take the flat list of routes.
lib.flatten
];
};
# To inspect the Wireguard interface at runtime using the `wg` tool.
environment.systemPackages = [ pkgs.wireguard-tools ];
};
}
Finally, we configure our firewall and NAT rules:
{ config, ... }:
{
boot.kernel.sysctl = {
# Enable forwarding on all interfaces.
"net.ipv4.conf.all.forwarding" = 1;
"net.ipv6.conf.all.forwarding" = 1;
};
networking.firewall.extraInputRules = ''
# Open the Wireguard port.
# You probably have to adjust this for your network situation.
ip saddr 192.168.0.0/24 udp dport ${toString config.kube.wgPort} accept
# Accept connections to Kubernetes Cluster IPs.
# These are virtual IPs that every node makes available locally.
ip daddr ${config.kube.servicesCidr} accept
'';
networking.firewall.extraForwardRules = ''
# Route all container traffic anywhere (internet and internode).
iifname brkube accept
# Route Wireguard traffic destined for local containers.
iifname wgkube ip6 daddr ${config.kube.nodeCidr6} accept
iifname wgkube ip daddr ${config.kube.nodeCidr4} accept
'';
# Apply NAT to traffic from containers to the internet.
# Here we create an `accept` rule to short-circuit traffic that
# _shouldn't_ have NAT, then apply NAT to the rest.
networking.nftables.tables = {
"kube-nat6" = {
family = "ip6";
name = "kube-nat";
content = ''
chain post {
type nat hook postrouting priority srcnat;
iifname brkube ip6 daddr fd88::/16 accept
iifname brkube masquerade
}
'';
};
"kube-nat4" = {
family = "ip";
name = "kube-nat";
content = ''
chain post {
type nat hook postrouting priority srcnat;
iifname brkube ip daddr 10.88.0.0/16 accept
iifname brkube masquerade
}
'';
};
};
}
At this point nodes should be able to ping eachother across the tunnel on their private IPs (fd88:*::1), but we won't be able to test the full networking setup until we have some containers running.
Hostnames
Kubernetes needs to be configured with a domain name where it will advertise Services in DNS. Many examples use cluster.local, but I find this a bad idea, because .local is for mDNS. Instead, I'll be using k8s.internal.
Nodes in Kubernetes register themselves with a name, typically whatever hostname is configured in the OS. However, I'm going to decouple this from the OS hostname and instruct Kubernetes to use k8s.internal everywhere, leaving the OS hostname untouched.
{
config,
lib,
pkgs,
...
}:
let
inherit (config.kube)
peers
nodeIndex
mkHostIp6
mkHostIp4
domain
mkNodeHost
;
in
{
options.kube = {
# The internal domain name we use for all purposes Kubernetes.
domain = lib.mkOption {
type = lib.types.str;
default = "k8s.internal";
};
# Function that defines the format for node hostnames.
mkNodeHost = lib.mkOption {
type = with lib.types; functionTo str;
default = index: "node${toString index}.${domain}";
};
# The hostname of the local node.
nodeHost = lib.mkOption {
type = lib.types.str;
default = mkNodeHost nodeIndex;
};
# All static hosts to add to the Kubernetes domain.
# This is in a similar format to `networking.hosts`.
allHosts = lib.mkOption {
type = with lib.types; attrsOf (listOf str);
};
# `allHosts` as a file in `/etc/hosts` format.
allHostsFile = lib.mkOption {
type = lib.types.path;
default = lib.pipe config.kube.allHosts [
(lib.mapAttrsToList (ip: hosts: "${ip} ${lib.concatStringsSep " " hosts}\n"))
lib.concatStrings
(pkgs.writeText "kubernetes-static-hosts.txt")
];
};
};
config = {
# Add all node hosts to the Kubernetes domain.
# The `mkBefore` ensures the node host is the first listed,
# which is what a reverse IP lookup resolves to.
kube.allHosts = lib.pipe peers [
(lib.imap1 (
index: entry: {
${mkHostIp6 index} = lib.mkBefore [ (mkNodeHost index) ];
${mkHostIp4 index} = lib.mkBefore [ (mkNodeHost index) ];
}
))
(lib.mergeAttrsList)
];
# Also add the static hosts to `/etc/hosts`.
networking.hostFiles = [ config.kube.allHostsFile ];
};
}
kube-apiserver
We're going to build a multi-node setup, but keep it close to a traditional setup of 1 database server + multiple workers. In this setup, the database server is the ideal place for any kind of centralized processing, so we'll be running those parts of Kubernetes there as well. Instead of calling it a database server, I'll call it the 'primary' server going forward.
{ config, lib, ... }:
{
options.kube = {
# Define roles for nodes. The first node will be the 'primary' node.
role = lib.mkOption {
type = lib.types.str;
default = if config.kube.nodeIndex == 1 then "primary" else "worker";
};
# The IP of the primary node.
primaryIp = lib.mkOption {
type = lib.types.str;
default = config.kube.mkHostIp6 1;
};
};
}
We'll add some further variables in kube.api to describe the API endpoint:
{ config, lib, ... }:
{
options.kube.api = {
# Kubernetes creates a Service with Cluster IP for its own API.
# This is always the first IP in the services subnet.
serviceIp = lib.mkOption {
type = lib.types.str;
default = "10.88.0.1";
};
# The HTTPS port the API server will listen on.
# This is only important when connecting directly to the primary node.
# When using the Kubernetes Service, it's translated to regular 443.
port = lib.mkOption {
type = lib.types.port;
default = 6443;
};
# Define an internal hostname for the API.
# This is only used when a node host needs to talk to the API.
# Containers instead use the Kubernetes Service to reach the API.
internalHost = lib.mkOption {
type = lib.types.str;
default = "api.${config.kube.domain}";
};
# Build the full internal URL to the API.
internalUrl = lib.mkOption {
type = lib.types.str;
default = "https://${config.kube.api.internalHost}:${toString config.kube.api.port}";
};
# An externally reachable host for the API.
# The API server builds URLs using this hostname, so you'll want to add
# this to DNS. Doesn't have to be fully public, could still be internal to
# your organization.
externalHost = lib.mkOption {
type = lib.types.str;
default = "test-kube.example.com";
};
# Build the full external URL to the API.
# We also use this as the 'audience' of API server JWTs.
externalUrl = lib.mkOption {
type = lib.types.str;
default = "https://${config.kube.api.externalHost}:${toString config.kube.api.port}";
};
};
config = {
# Add the internal API host to the Kubernetes domain.
kube.allHosts.${config.kube.primaryIp} = [ config.kube.api.internalHost ];
};
}
The API server uses etcd for storage by default. We'll be creating a very simple installation here and protect it using Unix sockets with limited permissions.
In a production setup, you want to make periodic backups of the data in etcd. You can do this using etcdctl snapshot save, or simply backup the files in
/var/lib/etcd/member/snap/db. (The former method can't be piped into some other command, but the latter method excludes the database WAL file. See etcd disaster recovery.)
{
config,
lib,
pkgs,
...
}:
# Only on the primary node.
lib.mkIf (config.kube.role == "primary") {
# Create a dedicated user and group so we can control access to the socket.
users.groups.etcd = { };
users.users.etcd = {
isSystemUser = true;
group = "etcd";
};
# Configure the systemd service unit.
systemd.services.etcd = {
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "notify";
User = "etcd";
ExecStart =
"${pkgs.etcd}/bin/etcd"
+ " --data-dir /var/lib/etcd"
# Compaction is disabled by default, but that apparently risks the
# database eventually exploding on itself. Weird default.
+ " --auto-compaction-retention=8h"
# Minimum set of options for secure local-only setup without auth.
# Access is limited to users in the 'etcd' group.
+ " --listen-peer-urls unix:/run/etcd/peer"
+ " --listen-client-urls unix:/run/etcd/grpc"
+ " --listen-client-http-urls unix:/run/etcd/http"
# This is required but not actually used in our case.
+ " --advertise-client-urls http://localhost:2379";
Restart = "on-failure";
RestartSec = 10;
# Actual data storage in /var/lib/etcd.
StateDirectory = "etcd";
StateDirectoryMode = "0700";
# Place our Unix sockets in /run/etcd.
RuntimeDirectory = "etcd";
RuntimeDirectoryMode = "0750";
};
postStart = ''
# Need to make sockets group-writable to allow connections.
chmod 0660 /run/etcd/{grpc,http}
'';
};
# For the `etcdctl` tool.
environment.systemPackages = [ pkgs.etcd ];
}
Now we are almost ready to start the API server! First we need to put some secrets in place for it.
You'll want an EncryptionConfiguration to tell Kubernetes how to encrypt Secret resources on disk. I recommend using a configuration with just
secretbox to start:
# Extend keys/secrets.nix.
"EncryptionConfiguration.yaml.age".publicKeys = [ master node1 ];
# Edit the encrypted file.
agenix -i master_key -e EncryptionConfiguration.yaml.age
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
# Expand this if you have custom resources that store sensitive data.
- resources:
- secrets
providers:
- secretbox:
keys:
- name: key1
# Generate this with: head --bytes=32 /dev/random | base64
secret: "<BASE 64 ENCODED SECRET>"
Next we need credentials for API server authentication. There are a bunch of methods available for this, but we'll be using the 'static token file' method, and handing a CSV file to the API server. A major downside of this is that the API server can't reload this at runtime, so changing any of these (such as when adding nodes) requires an API server restart.
We're going to create a root user in the API with full admin access.
# Extend keys/secrets.nix.
"kube_token_root.age".publicKeys = [ master node1 ];
# Generate and encrypt the token.
pwgen -s 64 | agenix -i master_key -e kube_token_root.age
Nodes also need tokens to register themselves in the API, and I'm going to use a dirty trick here: reuse the Wireguard private keys as tokens. This means the API server has access to all Wireguard private keys, but I figure compromise of the API server means you can execute arbitrary code on any node anyway. If you're more concerned, you could just generate separate tokens instead. In any case, to reuse the Wireguard keys, the primary node needs access:
# Update keys/secrets.nix and ensure node1 is listed for every Wireguard key.
"wgkube1.key.age".publicKeys = [ master node1 ];
"wgkube2.key.age".publicKeys = [ master node1 node2 ];
We also need some tokens for Kubernetes components that run alongside the API server on the primary node. I'm going to use the kube_token_system_ prefix for these, followed by the service name. That naming convention allows us to iterate files later.
# Extend keys/secrets.nix.
"kube_token_system_kube-controller-manager.age".publicKeys = [ master node1 ];
"kube_token_system_kube-scheduler.age".publicKeys = [ master node1 ];
# Generate and encrypt the tokens.
for uid in kube-controller-manager kube-scheduler; do
pwgen -s 64 | agenix -i master_key -e "kube_token_system_${uid}.age"
done
To connect these components to the API server, we provide a tool to help generate a kubeconfig file:
{
config,
lib,
pkgs,
...
}:
{
options.kube = {
# Small utility that helps us build a kubeconfig for our cluster.
# The caller should set $KUBECONFIG to the file to create / modify.
mkkubeconfig = lib.mkOption {
type = lib.types.package;
default = pkgs.writeShellApplication {
name = "mkkubeconfig";
runtimeInputs = [ pkgs.kubectl ];
text = ''
if [[ $# -ne 1 ]]; then
echo >&2 'Usage: mkkubeconfig <token file>'
exit 64
fi
# NOTE: The API server uses self-signed certificates. In this
# testing setup we instead rely on the Wireguard tunnel for security.
kubectl config set-cluster local --server '${config.kube.api.internalUrl}' --insecure-skip-tls-verify=true
kubectl config set users.default.token "$(<"$1")"
kubectl config set-context local --cluster=local --user=default
kubectl config use-context local
'';
};
};
};
}
We can finally slap together a NixOS module to start the API server. This is probably the most complex piece of Nix machinery in the setup.
{
config,
lib,
pkgs,
...
}:
let
package = lib.getBin pkgs.kubernetes;
apiPortStr = toString config.kube.api.port;
# NOTE: We put secerts in a separate variable here so we can easily gather
# all secrets in `LoadCredential` below. Using `config.age.secrets` would pull
# in secrets from elsewhere too, which is bad.
keysDirListing = builtins.readDir ./keys;
ageSecrets = lib.mergeAttrsList [
# Decrypt EncryptionConfiguration.
{ "EncryptionConfiguration.yaml".file = ./keys/EncryptionConfiguration.yaml.age; }
# Decrypt all API server tokens.
(lib.pipe keysDirListing [
(lib.filterAttrs (name: type: lib.hasPrefix "kube_token_" name))
(lib.mapAttrs' (
name: type: {
name = lib.removeSuffix ".age" name;
value.file = ./keys + "/${name}";
}
))
])
# Decrypt all Wireshark keys we reuse as tokens.
(lib.pipe keysDirListing [
(lib.filterAttrs (name: type: lib.hasPrefix "wgkube" name))
(lib.mapAttrs' (
name: type: {
name = "kube_token_node" + (lib.removePrefix "wgkube" (lib.removeSuffix ".key.age" name));
value.file = ./keys + "/${name}";
}
))
])
];
in
# Only on the primary node.
lib.mkIf (config.kube.role == "primary") {
age.secrets = ageSecrets;
# Create a dedicated user for kube-apiserver, so we can add it to the etcd group.
users.groups.kube-apiserver = { };
users.users.kube-apiserver = {
isSystemUser = true;
group = "kube-apiserver";
extraGroups = [ "etcd" ];
};
# Open the API server port in the firewall.
networking.firewall.extraInputRules = ''
tcp dport ${apiPortStr} accept
'';
systemd.services.kube-apiserver = {
wantedBy = [ "multi-user.target" ];
after = [ "etcd.service" ];
serviceConfig = {
Type = "notify";
ExecStart =
"${package}/bin/kube-apiserver"
# Connect to etcd.
+ " --etcd-servers='unix:/run/etcd/grpc'"
# HTTPS listener config.
# The certificate is generated in `preStart` below.
+ " --secure-port=${apiPortStr}"
+ " --tls-private-key-file='/var/lib/kube-apiserver/apiserver.key'"
+ " --tls-cert-file='/var/lib/kube-apiserver/apiserver.crt'"
# Authentication and authorization config.
# `tokens.csv` is generated in `preStart` below.
+ " --anonymous-auth=false"
+ " --token-auth-file='/var/lib/kube-apiserver/tokens.csv'"
+ " --authorization-mode='RBAC,Node'"
# Virtual IP range used for Service resources.
# These IPs are routed by kube-proxy on each machine, usually via NAT.
+ " --service-cluster-ip-range='${config.kube.servicesCidr}'"
# For the Service of the API server, advertise the node address.
# Because this also uses NAT, it must also be IPv4.
+ " --advertise-address='${config.kube.hostIp4}'"
# The externally reachable hostname for building API URLs.
+ " --external-hostname='${config.kube.api.externalHost}'"
# Configures signing and verification of JWTs used as service account tokens.
+ " --service-account-issuer='${config.kube.api.externalUrl}'"
+ " --api-audiences='api,${config.kube.api.externalUrl}'"
+ " --service-account-key-file='/var/lib/kube-apiserver/issuer.key'"
+ " --service-account-signing-key-file='/var/lib/kube-apiserver/issuer.key'"
# This sets up the encryption of Secret resources:
# https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/
+ " --encryption-provider-config='%d/EncryptionConfiguration.yaml'";
User = "kube-apiserver";
Restart = "on-failure";
RestartSec = 10;
# For generated keys and certificates.
StateDirectory = "kube-apiserver";
# Make secrets available.
LoadCredential = map (name: "${name}:/run/agenix/${name}") (lib.attrNames ageSecrets);
# For the `postStart` script.
PrivateTmp = true;
};
preStart = ''
openssl=${lib.getExe pkgs.openssl}
cd /var/lib/kube-apiserver
# Ensure a tokens file is present, or create an empty one.
[[ -e tokens.csv ]] || touch tokens.csv
chmod 0600 tokens.csv
# Ensure the token for the root user is present.
file="$CREDENTIALS_DIRECTORY/kube_token_root"
if ! grep -q ",root," tokens.csv; then
echo "$(<"$file"),root,root,system:masters" >> tokens.csv
fi
# Ensure tokens for system users are present.
for file in $CREDENTIALS_DIRECTORY/kube_token_system_*; do
filename="$(basename "$file")"
uid="''${filename#kube_token_system_}"
if ! grep -q ",system:$uid," tokens.csv; then
echo "$(<"$file"),system:$uid,system:$uid" >> tokens.csv
fi
done
# Ensure tokens for nodes are present.
for file in $CREDENTIALS_DIRECTORY/kube_token_node*; do
filename="$(basename "$file")"
uid="''${filename#kube_token_}.${config.kube.domain}"
if ! grep -q ",system:node:$uid," tokens.csv; then
echo "$(<"$file"),system:node:$uid,system:node:$uid,system:nodes" >> tokens.csv
fi
done
# Ensure a private key for HTTPS exists.
[[ -e apiserver.key ]] || $openssl ecparam -out apiserver.key -name secp256r1 -genkey
chmod 0600 apiserver.key
# Generate a new self-signed certificate on every startup.
# Assume services are restarted somewhere in this timeframe so that we
# never have an expired certificate.
$openssl req -new -x509 -nodes -days 3650 \
-subj '/CN=${config.kube.api.externalHost}' \
-addext 'subjectAltName=${
lib.concatStringsSep "," [
"DNS:${config.kube.api.externalHost}"
"DNS:${config.kube.api.internalHost}"
"IP:${config.kube.api.serviceIp}"
]
}' \
-key apiserver.key \
-out apiserver.crt
# Ensure a private key exists for issuing service account tokens.
[[ -e issuer.key ]] || $openssl ecparam -out issuer.key -name secp256r1 -genkey
chmod 0600 issuer.key
'';
postStart = ''
# Wait for the API server port to become available.
# The API server doesn't support sd_notify, so we do this instead to
# properly signal any dependant services that the API server is ready.
export KUBECONFIG=/tmp/kubeconfig
${lib.getExe config.kube.mkkubeconfig} "$CREDENTIALS_DIRECTORY/kube_token_root"
tries=60
while ! ${package}/bin/kubectl get namespaces default >& /dev/null; do
if [[ $((--tries)) -eq 0 ]]; then
echo ">> Timeout waiting for the API server to start"
exit 1
fi
sleep 1
done
rm $KUBECONFIG
'';
};
}
We setup a kubeconfig for root on the primary node to use the root API user. This allows using kubectl from the shell for easy administration:
{
config,
lib,
pkgs,
...
}:
# Only on the primary node.
lib.mkIf (config.kube.role == "primary") {
# Generate a kubeconfig for root, so that `kubectl` simply works.
system.activationScripts.kubeconfig-root = ''
HOME=/root ${lib.getExe config.kube.mkkubeconfig} "/run/agenix/kube_token_root"
'';
environment.systemPackages = [ pkgs.kubectl ];
}
And we also make node credentials available on each node, which will be used by services later:
{ lib, config, ... }:
{
# Creates /run/kubeconfig-node containing the node credentials.
# This is used by per-node services like kubelet, kube-proxy, coredns, etc.
systemd.services.generate-kubeconfig-node = {
wantedBy = [ "multi-user.target" ];
environment.KUBECONFIG = "/run/kubeconfig-node";
serviceConfig = {
Type = "oneshot";
ExecStart = "${lib.getExe config.kube.mkkubeconfig} /run/agenix/wgkube.key";
};
};
}
Add-ons
It's useful to have a way to load some YAML into the API server on startup. I use the term add-ons because I've seen it used for some now-deprecated functionality that was similar in function, but the term add-on has also been overloaded in various ways.
{
config,
lib,
pkgs,
...
}:
let
cfg = config.kube;
in
{
options.kube = {
# Run an activation script once the API is up.
activationScript = lib.mkOption {
type = lib.types.lines;
default = "";
};
# Apply addons once the API is up.
addons = lib.mkOption {
type = lib.types.listOf lib.types.path;
default = [ ];
};
};
config = {
assertions = [
{
assertion = cfg.activationScript != "" -> cfg.role == "primary";
message = "kube.activationScript and kube.addons can only be used on the primary node";
}
];
# NOTE: This not a postStart in kube-apiserver, because that would cause
# kube-apiserver to restart on changes.
systemd.services.kube-activation = lib.mkIf (cfg.activationScript != "") {
wantedBy = [ "multi-user.target" ];
bindsTo = [ "kube-apiserver.service" ];
after = [ "kube-apiserver.service" ];
path = [ pkgs.kubectl ];
# Connect to the API using the root credentials.
environment.KUBECONFIG = "/root/.kube/config";
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
};
script = cfg.activationScript;
};
# Activation script that processes `kube.addons`.
kube.activationScript = lib.mkIf (cfg.addons != [ ]) ''
for file in ${lib.escapeShellArgs (pkgs.copyPathsToStore cfg.addons)}; do
echo >&2 "# $file"
kubectl apply --server-side --force-conflicts -f "$file"
done
'';
};
}
kube-scheduler
Next we need to run kube-scheduler to actually schedule pods:
{
config,
lib,
pkgs,
...
}:
# Only on the primary node.
lib.mkIf (config.kube.role == "primary") {
systemd.services.kube-scheduler = {
wantedBy = [ "multi-user.target" ];
requires = [ "kube-apiserver.service" ];
after = [ "kube-apiserver.service" ];
serviceConfig = {
ExecStart =
"${pkgs.kubernetes}/bin/kube-scheduler"
# Connect to the API.
+ " --kubeconfig='/tmp/kubeconfig'"
# Disable listener, only useful for metrics.
+ " --secure-port=0";
Restart = "on-failure";
RestartSec = 10;
# Let systemd assign a user for this service.
DynamicUser = true;
# For the below `preStart` that generates kubeconfig.
PrivateTmp = true;
LoadCredential = "kube-token:/run/agenix/kube_token_system_kube-scheduler";
};
preStart = ''
# Generate a kubeconfig for the scheduler. Relies on PrivateTmp.
KUBECONFIG=/tmp/kubeconfig ${lib.getExe config.kube.mkkubeconfig} "$CREDENTIALS_DIRECTORY/kube-token"
'';
};
}
kube-controller-manager
Similarly, we need to run kube-controller-manager, which contains all the standard Kubernetes controllers:
{
config,
lib,
pkgs,
...
}:
# Only on the primary node.
lib.mkIf (config.kube.role == "primary") {
systemd.services.kube-controller-manager = {
wantedBy = [ "multi-user.target" ];
# NOTE: This 'bindsTo' also ensures an up-to-date API certificate is published.
# When separating kube-controller-manager from kube-apiserver, some other mechanism
# is required to distribute certificates.
bindsTo = [ "kube-apiserver.service" ];
after = [ "kube-apiserver.service" ];
serviceConfig = {
ExecStart =
"${pkgs.kubernetes}/bin/kube-controller-manager"
# Connect to the API.
+ " --kubeconfig='/tmp/kubeconfig'"
# Disable listener, only useful for metrics.
+ " --secure-port=0"
# This makes the controller manager automagically create a service
# account for each of its controllers. Neat.
+ " --use-service-account-credentials=true"
# This publishes the correct API certificate in the API itself.
# Pods see this as `/var/run/secrets/kubernetes.io/serviceaccount/ca.crt`.
+ " --root-ca-file='/var/lib/kube-apiserver/apiserver.crt'";
Restart = "on-failure";
RestartSec = 10;
# Let systemd assign a user for this service.
DynamicUser = true;
# For the below `preStart` that generates kubeconfig.
PrivateTmp = true;
LoadCredential = "kube-token:/run/agenix/kube_token_system_kube-controller-manager";
};
preStart = ''
# Generate a kubeconfig for the controller manager. Relies on PrivateTmp.
KUBECONFIG=/tmp/kubeconfig ${lib.getExe config.kube.mkkubeconfig} "$CREDENTIALS_DIRECTORY/kube-token"
'';
};
}
CoreDNS
We need to provide DNS resolution based on Services in the Kubernetes API.
Many deployments run CoreDNS inside Kubernetes, but there's really no standard for how you implement DNS resolution, and different deployments have different needs. As long as you have something that fetches Services from the Kubernetes API.
Here we setup CoreDNS, but not inside Kubernetes, instead managed by NixOS. We run an instance on every node for simplicity.
{
config,
lib,
pkgs,
...
}:
{
services.coredns = {
enable = true;
config = ''
. {
bind ${config.kube.hostIp6}
errors
# Resolve Kubernetes hosts.
hosts ${config.kube.allHostsFile} ${config.kube.domain} {
reload 0
fallthrough
}
# Resolve Kubernetes services.
kubernetes ${config.kube.domain} {
kubeconfig {$CREDENTIALS_DIRECTORY}/kubeconfig-node
ttl 30
# NOTE: No fallthrough, to prevent a loop with systemd-reoslved.
}
# Forward everything else to systemd-resolved.
forward . 127.0.0.53 {
max_concurrent 1000
}
cache 30
loadbalance
}
'';
};
# Provide kubeconfig-node to CoreDNS.
systemd.services.coredns = {
requires = [ "generate-kubeconfig-node.service" ];
after = [
"generate-kubeconfig-node.service"
"kube-activation.service"
];
serviceConfig.LoadCredential = "kubeconfig-node:/run/kubeconfig-node";
};
# Setup systemd-resolved to forward the Kubernetes domain to CoreDNS.
environment.etc."systemd/dns-delegate.d/kubernetes.dns-delegate".text = ''
[Delegate]
Domains=${config.kube.domain}
DNS=${config.kube.hostIp6}
'';
# Open the DNS port to containers.
networking.firewall.extraInputRules = ''
ip6 saddr ${config.kube.nodeCidr6} udp dport 53 accept
ip6 saddr ${config.kube.nodeCidr6} tcp dport 53 accept
'';
# API resources needed for CoreDNS.
kube.addons = lib.mkIf (config.kube.role == "primary") [
./addons/coredns.yaml
];
# For inspecting DNS servers.
environment.systemPackages = [ pkgs.dig ];
}
The referenced add-on file addons/coredns.yaml creates the permissions needed for CoreDNS to access the Kubernetes API:
# Define the coredns role and bind it to the regular node group,
# so that the same node credentials can be used for CoreDNS.
#
# Based on the roles from the upstream addon:
# https://github.com/kubernetes/kubernetes/blob/v1.34.2/cluster/addons/dns/coredns/coredns.yaml.base
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:coredns
namespace: kube-system
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
- pods
- namespaces
verbs:
- list
- watch
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:coredns
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:coredns
subjects:
- kind: Group
name: system:nodes
apiGroup: rbac.authorization.k8s.io
kube-proxy
Kube-proxy is what implements cluster IPs assigned to Service resources in the API. It generates firewall rules to NAT cluster IPs to destination pods. It needs to run on every node.
(NOTE: If you decide not to run kubelet on your control plane / primary node, you still need to run kube-proxy! The API server may sometimes contact Services via their Cluster IP too.)
{
lib,
config,
pkgs,
...
}:
{
systemd.services.kube-proxy = {
wantedBy = [ "multi-user.target" ];
requires = [ "generate-kubeconfig-node.service" ];
after = [
"generate-kubeconfig-node.service"
"kube-activation.service"
];
path = [ pkgs.nftables ];
serviceConfig = {
ExecStart =
"${lib.getBin pkgs.kubernetes}/bin/kube-proxy"
# Connect to the API using node credentials.
+ " --kubeconfig='/run/kubeconfig-node'"
+ " --hostname-override='${config.kube.nodeHost}'"
# Prefer nftables mode.
+ " --proxy-mode=nftables"
# Local traffic can be detected by the bridge interface.
+ " --detect-local-mode=BridgeInterface"
+ " --pod-bridge-interface=brkube"
# Addresses to accept NodePort service ports on.
+ " --nodeport-addresses='${config.kube.hostIp6}/128,${config.kube.hostIp4}/32'"
# Can't seem to disable these listeners, so make sure they only listen on localhost.
+ " --healthz-bind-address=[::1]:10256"
+ " --metrics-bind-address=[::1]:10249";
Restart = "on-failure";
RestartSec = 10;
};
};
# API resources needed for CoreDNS.
kube.addons = lib.mkIf (config.kube.role == "primary") [
./addons/kube-proxy.yaml
];
}
The referenced add-on file addons/kube-proxy.yaml is again necessary to setup permissions in the Kubernetes API:
# Bind the kube-proxy role to the regular node group,
# so that the same node credentials can be used for kube-proxy.
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:kube-proxy
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node-proxier
subjects:
- kind: Group
name: system:nodes
apiGroup: rbac.authorization.k8s.io
kubelet
Kubelet is the meat that starts containers on a node, when a pod is assigned to that node. Here we also do the work to setup the cri-o container runtime and the CNI configuration that tells it how containers get network.
You technically only need kubelet on machines that run workloads. We simply start it everywhere, including our primary node, but mark the primary as non-schedulable to demonstrate registerWithTaints.
{
lib,
config,
pkgs,
...
}:
let
yaml = pkgs.formats.yaml { };
kubeletConfig = yaml.generate "kubelet.conf" {
apiVersion = "kubelet.config.k8s.io/v1beta1";
kind = "KubeletConfiguration";
# Allow anonymous access, but bind to the secure Wireguard IP.
# This is further locked down by firewall rules.
address = config.kube.hostIp6;
authentication.anonymous.enabled = true;
authorization.mode = "AlwaysAllow";
# Disable other listeners.
healthzPort = 0;
# Use CRI-O.
containerRuntimeEndpoint = "unix:///var/run/crio/crio.sock";
# Don't complain about swap, but don't account for it either.
failSwapOn = false;
memorySwap.swapBehavior = "LimitedSwap";
# Configure DNS using the local CoreDNS server.
clusterDomain = config.kube.domain;
clusterDNS = [ config.kube.hostIp6 ];
# Prevent scheduling pods on the primary node.
registerWithTaints = lib.optional (config.kube.role == "primary") {
key = "role";
value = config.kube.role;
effect = "NoSchedule";
};
};
in
{
virtualisation.cri-o = {
enable = true;
extraPackages = [ pkgs.nftables ];
settings.crio.runtime.log_to_journald = true;
};
systemd.services.kubelet = {
wantedBy = [ "multi-user.target" ];
requires = [
"generate-kubeconfig-node.service"
"crio.service"
];
after = [
"generate-kubeconfig-node.service"
"crio.service"
"kube-activation.service"
];
path = [ pkgs.util-linux ];
serviceConfig = {
Type = "notify";
ExecStart =
"${lib.getBin pkgs.kubernetes}/bin/kubelet"
# Connect to the API using node credentials.
+ " --kubeconfig='/run/kubeconfig-node'"
# Ensure the Node is registered with the expected hostname.
+ " --hostname-override='${config.kube.nodeHost}'"
# Publish our preferred IPv6 node IP.
+ " --node-ip='${config.kube.hostIp6}'"
# Announce the role of this node as a label.
+ " --node-labels='role=${config.kube.role}'"
# Most other flags are deprecated in favour of a config file.
+ " --config='${kubeletConfig}'";
Restart = "on-failure";
RestartSec = 10;
StateDirectory = "kubelet";
};
};
# cri-o bundles an example config file that NixOS installs by default, but we
# override that here with our own configuration.
environment.etc."cni/net.d/10-crio-bridge.conflist".text = lib.mkForce (
builtins.toJSON {
cniVersion = "1.0.0";
name = "brkube";
plugins = [
{
type = "bridge";
bridge = "brkube";
isGateway = true;
ipam = {
type = "host-local";
ranges = [
[ { subnet = config.kube.nodeCidr6; } ]
[ { subnet = config.kube.nodeCidr4; } ]
];
routes = [
{ dst = "::/0"; }
{ dst = "0.0.0.0/0"; }
];
};
}
];
}
);
# Ensure kube-apiserver can connect to this kubelet.
# This is necessary for `kubectl logs`, `kubectl exec`, etc.
networking.firewall.extraInputRules = ''
ip6 saddr ${config.kube.primaryIp} tcp dport 10250 accept
tcp dport 10250 reject
'';
}
Testing
The setup should now be fully functional! If you login as root on the primary node, you can use kubectl:
# kubectl get node
NAME STATUS ROLES AGE VERSION
node1.k8s.internal Ready <none> 19s v1.34.1
node2.k8s.internal Ready <none> 12s v1.34.1
With node2 in the listing, we know connectivity works from kubelet to API server. Starting a container with an interactive session also tests the opposite direction. In addition, we can test connectivity from the container to the internet:
# kubectl run --rm -it --image=docker.io/alpine test
/ # wget -O - https://example.com/
Connecting to example.com (23.220.75.245:443)
writing to stdout
<!doctype html><html ...
What next?
While this setup has all the essentials for workloads, a bunch of stuff is missing to make it more broadly useful.
An Ingress / Gateway controller helps route traffic to containers. The go-to used to be nginx-ingress, but nginx-ingress is going the way of the dodo. I had some fun hacking on caddy-ingress, but that's still experimental. There's a list of Gateway controllers and a list of Ingress controllers if you want to explore.
A storage provsioner can help with data persistence. The modern solution for this is CSI drivers. Provided are drivers for NFS and SMB shares, which are really useful if you're coming from a setup where applications share some NFS directories hosted on the primary node. But storage for databases is ideally block storage, which is a bit more work.
Speaking of databases, the nice thing about this setup is that you can simply run services outside Kubernetes, so you can just start a database using regular NixOS config on the primary node for example. I had some fun writing my own controller that allows managing MySQL databases with custom Kubernetes resources: external-mysql-operator. Again, very experimental.
Takeaways
Would I take this into production? Not anytime soon, because I feel like there are a whole bunch of failure modes I've not yet seen. My testing has been limited to QEMU VMs and some AWS EC2 instances.
Especially on VMs, which are typically quite small compared to dedicated servers, Kubernetes itself uses up a chunk of memory and CPU just sitting there.
With the traction Kubernetes has, it does feel like there must be many small installations out there. And if that's the case, it seems to me that Kubernetes could easily reduce some complexity for that type of installation.
For example, do you really need etcd and API server redundancy? It seems upstream SQLite support in combination with Litestream backups would be far more beneficial for smaller installations, when you're happy to deal with some Kubernetes API downtime during upgrades or incidents.
Another easy win (in my opinion) would be runtime reloading of the token auth file. It would instantly make it a more viable option beyond testing. Though with a bit of extra work it can also be accomplished using the webhook or reverse proxy mechanisms supported by Kubernetes.
Overall, though, it feels like Kubernetes itself is maybe only half the complexity, with the other half going to network configuration.