The unwelcome school of Netfilter connection tracking, and high traffic networking
Today while conducting our regular maintenance, there was something that I noticed:
The server is running out of conntrack table space.
"Oh! not again, it's always conntrack!"
Ah yes, the pain of Netfilter Connection Tracker. It is back again to haunt the newborn-busy Reimu server.
To start the pain, Docker is what powers our infrastructure. And with that comes what we know as NAT (Network Address Translation) because we don't want to host networking except for the router (We're running Traefik as a router and load balancer).
We have "a lot" of microservices, and ports are precious for us, and with that, nf_conntrack is back to haunt us.
We have traefik handling frontend, and we want IP forwarding to correctly forward incoming and outgoing packets. Reimu is quite a busy server as it's our Core server, handling a lot of things, including our download service, which produces a lot of I/O and, of course, Address to Translates.
Every month (and mid-month), we release a new build for gourami (well, only gourami builds are actually served through the download service at the moment because, well, we can't make it "Operators Friendly" yet. We're open to help if you want to help us with the Frontend (or Backend). Just hit me up at [email protected] or at our Telegram channel).
And this generally saturates our Gigabit NIC in Reimu (well, quite a downgrade from our 10G NIC at Rika, but Rika is dead anyway, and now we have a DigitalOcean Spaces S3 Storage).
And, oh boy, it generates a lot of packet-per-second and connections to tracks, and with that, comes complaints from some members that their download suddenly stopped for no good reason.
With this noted, I opened a regular maintenance schedule to see what causes this, when this happens.
After ingesting a bunch of logs, I noticed something:
With the root cause of dropped packet found, now to the diagnosis step, starting with
playing among us checking our
nf_conntrack_max value which is
262144, no shit sherlock.
Now we come to calculating how big the table needs based on our RAM Size. Reimu, has 64 GB of RAM, and the formula to calculate optimal table size based on Huawei ECS Wiki:
RAM Size (in bytes)/16384/2
With this, then we can translate that the formula to
65786680000/16384/2 which produces
2007650,146484375 and finally, we round that up to
2097152 (We cheated a bit given that Huawei ECS Wiki actually gave us the recommended value for 64 GB RAM ECS lmao)
Then, for the Hashsize, as we scale it up immensely, we need to scale it up as well using this formula:
conntrack_max/4, and we got
After we got this value for it, then we can do
echo "options nf_conntrack expect_hashsize=131072 hashsize=131072" >/etc/modprobe.d/firewalld-sysctls.conf
systemctl restart firewalld to restart firewalld.
Boom, all done! Are we good to go home now?
No. We have not done it yet.
Linux, by default, configuring Netfilter connection tracker timeout values outlandishly high, like a 5 Days (!!!) timeout for established connections, and we have to do some work on this part.
For starters, let's take general values used by Mikrotik RouterOS (Yay, less work!) given that RouterOS is widely used and "representative" as a router (lol)
With this, we can start translating the values to sysctl
And we end up with this:
F'k, finally, now we can sleep until something else happens...
Wait, you said something about not saturating your Gigabit NIC, right?
Well, for starters, we're using CAKE, and we have net0 as the interface heading to the main switch, and we have full-1G provisioned.
To set CAKE up, the command will be:
tc qdisc replace dev net0 root cake bandwidth 1024mbit besteffort
Within a server, using 100% of the provisioned bandwidth may work fine in practice. Unlike a local network connected to a consumer ISP, you shouldn't need to sacrifice anywhere close to the typically recommended 5-10% of provisioned bandwidth for traffic shaping.
We also set the best effort for the common scenario where the server doesn't have appropriate Quality of Service markings set up via DiffServ.
Fair scheduling is already great at providing low latency by cycling through the hosts and streams without needing this configuration. The defaults for DiffServ traffic classes like real-time video yield substantial bandwidth in exchange for lower latency. It's easy to set this up wrong, and it usually won't make much sense on a server. You might want to set up marking low priority traffic like system updates, but it will already get a tiny share of the overall traffic on a loaded server due to fair scheduling between hosts and streams.
You can use this command to monitor CAKE :
watch -n 1 tc -s qdisc show dev net0
So that's it for today's class, see y'all next time!