IPv6 on production Docker

At Angry Bytes, we deploy many of our apps to Docker. The setup is nothing fancy, running plain Debian and managed by a bunch of custom scripts. This has worked great for us, deploying the various tools we use ourselves and for our customers to use.

I've recently done some upgrades in the setup, and finally enabled IPv6 for these apps.

The challenge in dual stack

Docker has a guide on IPv6. The short version of this document is: “Here's how to assign global addresses to your containers, you do the rest.” This is very much in contrast to the plug-and-play NAT routing Docker does for you in the IPv4 world.

There's a difficulty in setting up both IPv4 and IPv6 in that you often have to create two very different views of your infrastructure. On the one side, services are available with a direct IPv6 address, and on the other side behind an IPv4 NAT router. And this goes through all layers of configuration: routers, DNS, the hosts themselves, etc.

The current situation in Docker amplifies this, because every host is a router for its containers, and with IPv6, this is now in your hands.

Bonus gripe: Docker assigns IPv6 addresses sequentially, and throws privacy extensions out the window.

Unpopular: IPv6 NAT

I picked a perhaps unpopular direction to try simplify things: to abandon the plug-and-play Docker routing for IPv4, and to abandon global IPv6 addresses for containers, putting both IPv4 and IPv6 behind manual NAT instead. This gives us uniform routing across both stacks.

Being able to globally address services is a very good thing, but arguably, containers are just segments of a host. At least for us, there's not much benefit to giving containers a global address.

On top of this, IaaS providers can be troublesome. At the time of writing:

  • Amazon Web Service does not give you enough control over routing to delegate an entire subnet to your Docker machine.

  • Google Cloud only supports IPv6 at the load balancer. Individual instances have no IPv6 connectivity.

  • Linode and Scaleway assign you just a single IPv6 address.

But hopefully that situation will improve in the near future.

Bonus challenge: nftables

In the remainder of this article, I'll be talking about nftables. This is the new tool in Linux land set to replace iptables. The nftables wiki is a great resource to start learning about this, but hopefully it's not too difficult to follow along with a little bit of iptables experience.

We're running release candidates of Debian 9 (stretch) to get the up-to-date kernel and tool versions, but I believe backports on Debian 8 (jessie) also suffice.

To disable the plug-and-play routing in Docker, we need to pass an extra flag to dockerd from the systemd unit. Create an override file /etc/systemd/system/docker.service.d/10-args.conf with:

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --iptables=false

But there's a caveat: Docker will still try to use iptables for its embedded DNS server. Within each container, Docker creates DNAT rules for DNS on the special loopback address 127.0.0.11, and configures that address as the nameserver in /etc/resolv.conf. (You won't normally see these rules on the host, because they live in a different network namespace.)

We need to keep these rules working, and they need to be in nftables, because NAT can't be mixed between nftables and iptables.

First, blacklist iptables to prevent it troubling nftables NAT. Create /etc/modprobe.d/noiptables.conf with:

install ip_tables /bin/false
install ip6_tables /bin/false

The nftables project has created a compatibility layer that partially translates iptables syntax to nftables, which you can install with:

apt-get install iptables-compat

We can't outright uninstall iptables, because it's a Docker dependency. But we can trick Docker into preferring iptables-compat by creating a symlink:

ln -s /sbin/iptables-compat /usr/local/sbin/iptables

Fixed subnets

Docker automatically determines subnets for its networks for IPv4, but requires you do this manually for IPv6. But since we're going to create our own rules, we actually want manual assignment everywhere.

Extend /etc/systemd/system/docker.service.d/10-args.conf as follows:

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --iptables=false --ipv6 --fixed-cidr=172.17.0.0/16 --fixed-cidr-v6=fcdd::/48

The new settings, starting at --ipv6, configure the default bridge network in Docker.

The subnet 172.17.0.0/16 is usually already the subnet selected for this network, and we simply formalize that here so we can depend on it in nftables.

The fc00::/8 net in IPv6 is reserved for manually assigned local addresses, similar to e.g. 172.16.0.0/12 in IPv4, so how you assign this is (also) up to you. I've picked the fcdd::/16 prefix for everything Docker, and assign smaller subnets to individual networks. Here, fcdd::/48 is the default network.

When creating more networks, you use similar, but unfortunately slightly different options. As an example:

docker network create \
  --ipv6 \
  --subnet=172.18.0.0/16 \
  --subnet=fcdd:1::/48 \
  front

But I'll assume there's just the default network from here on.

Docker will automatically enable packet forwarding in Linux. If you are using router advertisements to configure IPv6 on your host, you may notice the default route suddenly disappeared. This is because Linux will by default not accept router advertisements when operating as a router itself. You can force it to with:

sysctl net.ipv6.conf.eth0.accept_ra=2

Routing rules

The Debian nftables package ships with an example /etc/nftables.conf, as well as examples in /usr/share/doc/nftables/examples/.

We simply expand on the default configuration to setup what you'd roughly expect your standard home NAT router to do. A basic /etc/nftables.conf might look like this:

#!/usr/sbin/nft -f

define docker_v4 = 172.17.0.0/16
define docker_v6 = fcdd::/48

# start with a clean slate
flush ruleset

table inet filter {
  chain input {
    # default input policy is drop
    type filter hook input priority 50; policy drop;

    # accept any localhost traffic
    iif "lo" accept

    # accept any docker traffic
    ip saddr $docker_v4 accept
    ip6 saddr $docker_v6 accept

    # accept any icmp traffic
    ip protocol icmp accept
    ip6 nexthdr ipv6-icmp accept

    # accept any established connection traffic
    ct state established,related accept
  }

  chain forward {
    # default forward policy is drop
    type filter hook forward priority 50; policy drop;

    # accept any docker traffic going to the internet
    ip saddr $docker_v4 oif eth0 accept
    ip6 saddr $docker_v6 oif eth0 accept

    # accept any established connection traffic
    ct state established,related accept
  }

  chain output {
    # default output policy is accept
    type filter hook output priority 50; policy accept;
  }
}

table ip nat {
  chain prerouting {
    type nat hook prerouting priority 0;
  }

  chain postrouting {
    type nat hook postrouting priority 100;

    # apply source nat for docker traffic to the internet
    ip saddr $docker_v4 oif eth0 masquerade
  }
}

table ip6 nat {
  chain prerouting {
    type nat hook prerouting priority 0;
  }

  chain postrouting {
    type nat hook postrouting priority 100;

    # apply source nat for docker traffic to the internet
    ip6 saddr $docker_v6 oif eth0 masquerade
  }
}

Important notes about this:

  • How exactly you match traffic is a matter of personal preference. For example, here I match on the specific subnets, not the entire fcdd::/16 prefix. Any new networks your create in Docker would require new rules.

  • I've setup these rules to work regardless of the address assigned to the network interface (masquerade instead of snat), because many cloud providers use some form of dynamic configuration.

  • The prerouting and postrouting tables need to exist for NAT to work, even if they are otherwise empty.

Following this, you can load the ruleset (and reload it at any time) with:

nft -f /etc/nftables.conf

A systemd unit is available to load the ruleset across restarts:

systemctl enable nftables

Exposing ports

For the most part, having Docker assign the IP addresses to your containers is just fine. You can still link things together using the embedded Docker DNS server.

However, containers that need to have ports exposed on the host now require some manual steps.

First, manually assign an IP address to the container. The important options in the below example are --ip and --ip6:

docker run -d --restart always --name nginx \
  --ip 172.17.50.1 \
  --ip6 fcdd::50:1 \
  -v /srv/nginx/etc:/etc/nginx \
  nginx:alpine

I've chosen the 50 prefix for both IPv4 and IPv6 here, because it's high enough to not normally conflict with Docker's own numbering of other containers, and is fairly easy to remember and recognize.

Next, create rules to NAT the required ports to the container:

define nginx_v4 = 172.18.50.1
define nginx_v6 = fcdd::50:1

# apply destination nat for nginx traffic
add rule ip nat prerouting iif eth0 tcp dport { 80, 443 } dnat $nginx_v4
add rule ip6 nat prerouting iif eth0 tcp dport { 80, 443 } dnat $nginx_v6

# accept nginx traffic
add rule inet filter forward ip daddr $nginx_v4 tcp dport { 80, 443 } accept
add rule inet filter forward ip6 daddr $nginx_v6 tcp dport { 80, 443 } accept

Of note here is that we match on interface eth0 instead of our address, which is not as neat as I'd personally like it to be, but works for a dynamic address. (Notably, the host and its other containers won't be able to reach nginx through the port forwards.)

Closing thoughts

The entire setup is driven by the principle that I want us to have IPv6 enabled services by default. This is increasingly becoming a requirement, for example to get apps accepted in Apple's App Store.

It's also a bunch of extra work to accomplish something Docker already provides out of the box for IPv4, which seems fine for most people. Hopefully, in the future, we'll have more out-of-the-box solutions available for IPv6, both from Docker and IaaS providers.

Nevertheless, some understanding of what happens under the hood of container networking, and new tools like nftables, is always nice to have. Perhaps this will help us to more easily investigate new alternatives to Docker as they gain traction.