The unwelcome school of Netfilter connection tracking, and high traffic networking

The unwelcome school of Netfilter connection tracking, and high traffic networking
netfilter lit.

Today we did regular maintenance, and with that, something that I noticed :

The server is running out of conntrack table space.

"Oh! not again, it's always conntrack!"

Ah yes, the pain of Netfilter Connection Tracker, it back to haunting the newborn-busy Reimu server.

To start the pain, Docker is what powers our infrastructure. And with that comes what we know as NAT (Network Address Translation) because we don't want host networking except for the router (We're running Traefik as router and load balancer).

We have "a lot" of microservices, and ports are precious for us, and with that, nf_conntrack is back to haunt us.

We have traefik handling frontend, and we want IP Forwarding to correctly forward incoming and outgoing packets. Reimu is quite a busy server as it's our Core server, handling a lot of things, including our Download service, which produces a lot of I/O and, of course, Address to Translates.

Every month (and mid-month), we release a new build for gourami (well, only gourami builds that are served through Download service at the moment because, well, we can't yet make it "Operators Friendly", we're open if you want to help us with Frontend (or Backend), just reach me up at [email protected] or at our Telegram),

And this generally saturates our Gigabit NIC in Reimu (Well, quite a downgrade from our 10G NIC at Rika, but Rika is dead anyway, and now we have a DigitalOcean Spaces S3 Storage).

And, oh boy, it generates a lot of packet-per-second and connections to tracks, and with that, come some humans complaining that they got their download suddenly die for no good reason.

With this noted, I opened a regular maintenance schedule to see what happens when this happens.

After ingesting a bunch of logs, I noticed something:

bruh

With the root cause of dropped packet found, now to the diagnosis step, starting with playing among us checking our nf_conntrack_max value which is 262144, no shit sherlock.

Now we come to calculating how big the table needs based on our RAM Size. Reimu, have 64 GB of RAM, and the formula to calculate optimal table size based on Huawei ECS Wiki:

RAM Size (in bytes)/16384/2

With this, then we translate that formula to 65786680000/16384/2 which produces 2007650,146484375, finally, we round that up to 2097152 (We cheated a bit given that Huawei ECS Wiki actually gave us the recommended value for 64 GB RAM ECS lmao)

Then, for the Hashsize, as we scale it up immensely, we need to scale it up too using this formula: conntrack_max/4, and we got 131072.

After we got this value for it, then we can do

echo "options nf_conntrack expect_hashsize=131072 hashsize=131072" >/etc/modprobe.d/firewalld-sysctls.conf

then systemctl restart firewalld to restart firewalld.

Boom, all done! Are we good to go home now?

No. We have not done it yet.

Linux, by default, configuring Netfilter connection tracker timeout values outlandishly high, like a 5 Days (!!!) timeout for established connections, and we have to do some work on this part.

For starters, let's take general values used by Mikrotik RouterOS (Yay, less work!) given that RouterOS is widely used and "representative" as a router (lol)

sussy

With this, we can start translating the values to sysctl

And we end up with this :

Fuah, finally, now we can sleep until something else happens...

Wait, you said something about not saturating your Gigabit NIC, right?

Pain.

Well, for starters, we're using CAKE, and we have net0 as the interface heading to the main switch, and we have full-1G provisioned.

To set CAKE up, the command will be:

tc qdisc replace dev net0 root cake bandwidth 1024mbit besteffort

Within a server, using 100% of the provisioned bandwidth may work fine in practice. Unlike a local network connected to a consumer ISP, you shouldn't need to sacrifice anywhere close to the typically recommended 5-10% of provisioned bandwidth for traffic shaping.

We also set the best effort for the common scenario where the server doesn't have appropriate Quality of Service markings set up via DiffServ.

Fair scheduling is already great at providing low latency by cycling through the hosts and streams without needing this configuration. The defaults for DiffServ traffic classes like real-time video yield substantial bandwidth in exchange for lower latency. It's easy to set this up wrong, and it usually won't make much sense on a server. You might want to set up marking low priority traffic like system updates, but it will already get a tiny share of the overall traffic on a loaded server due to fair scheduling between hosts and streams.

You can use this command to monitor CAKE :

watch -n 1 tc -s qdisc show dev net0

So that's it for today's class, see y'all next time!