Linux Netfilter tweaks for High traffic servers

If you are serving a high traffic web/DNS server, and recently having PING loss to the server and not all HTTP request were successful to it. You may start checking your system log. And if you see something similar as below, then following guides below will help you to tune your linux server to handle the traffic loads properly.

Mar 22 21:25:55 localhost kernel: nf_conntracktable fulldropping packet.
Mar 22 21:26:00 localhost kernel: printk: 11 messages suppressed.
Mar 22 21:26:00 localhost kernel: nf_conntracktable fulldropping packet.
Mar 22 21:26:05 localhost kernel: printk: 16 messages suppressed. 


And there seems to be some intermediate DNS resolution interruptions when you issue any "yum search" command or "links" to get any web content.


To get the netfilter values below, you may always use this command to show your current loaded value.


sysctl -a | grep netfilter

Increasing count of connections tracking

To carry out its tasks, NAT-server is "remember" all the connections that pass through it. Whether it’s "ping" or someone’s "ICQ" – NAT-server "remembers" and follows in his memory in a special table all of these sessions. When the session closes, information about it is deleted from the connection tracking table. The size of this table is fixed. That is why, if the traffic through the server is quite a lot but lacks the size of the table, – then NAT-server starts to drop packets and just breaks sessions. To avoid such horrors, it is necessary to adequately increase the size of the connection tracking table – in accordance with the traffic passing through NAT:

Default value is 65536, you may always just double this value or triple this value when necessary.

/sbin/sysctl -w net.netfilter.nf_conntrack_max = 196608

To make it permanent after reboot, please add these values to the sysctl.conf

echo net.ipv4.netfilter.ip_conntrack_max = 196608 >> /etc/sysctl.conf

It is not recommended to put so big value if you have less than 1 gigabyte of RAM in your NAT-server. To show the current value you can use something like this:

/sbin/sysctl net.netfilter.nf_conntrack_max

See how connection tracking table is already full can be like this:

/sbin/sysctl net.netfilter.nf_conntrack_count


Increasing the size of hash-table

Hash table, which stores lists of conntrack-entries, should be increased proportionately.
Here is the rule of adjusting: hashsize = nf_conntrack_max / 8
Default value will be 16384, and you can't set this value in the /etc/sysctl.conf file.
To change on the fly,

echo 24576 > /sys/module/nf_conntrack/parameters/hashsize

and add to the /etc/modprobe.conf

options ip_conntrack hashsize=24576



Decreasing time-out values

NAT-server only tracks "live" session which pass through it. When the session is closed – information about it is removed so that the connection tracking table does not overflow. Information about the sessions is removed as a timeout. That is, if a session is empty a long time, it is closed and information about it is just removed from the connectionn tracking table.

However, the default value of time-outs are quite large. Therefore, for large flows of traffic even if you stretch nf_conntrack_max to the limit – you can still run the risk of quickly run into the overflow table, and the connection is broken. To this did not happen, you must correctly set the timeout connection tracking on NAT-server. Current values can be seen, for example:

sysctl -a | grep conntrack | grep timeout

As a result, you’ll see something like this:

net.netfilter.nf_conntrack_generic_timeout = 600 net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120 net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60 net.netfilter.nf_conntrack_tcp_timeout_established = 432000 net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120 net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 net.netfilter.nf_conntrack_tcp_timeout_close = 10 net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300 net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300 net.netfilter.nf_conntrack_udp_timeout = 30 net.netfilter.nf_conntrack_udp_timeout_stream = 180 net.netfilter.nf_conntrack_icmp_timeout = 30 net.netfilter.nf_conntrack_events_retry_timeout = 15

This is the value of timeouts in seconds. As you can see, the value net.netfilter.nf_conntrack_generic_timeout is 600 (10 minutes). Ie NAT-server keeps in mind about the session as long as it is to "run over" anything at least once every 10 minutes.

At first glance, that’s okay – but in fact it is very, very bad. If you look at net.netfilter.nf_conntrack_tcp_timeout_established – you will see there is value 432000. In other words, your NAT-server will support a simple TCP-session as long as it does runs on some bag at least once every 5 days (!).

to adjust,

echo "net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 86400" >> /etc/sysctl.conf

This does not requires a reboot, instead, the value will be effective when the timeout value has reached and new value will take effect.

Speaking even more simply, it is just easy to DDOS such NAT-server: his connection-tracking table (nf_conntrack_max) overflows with simple flood – so that he will break the connection and in the worst case quickly turns into a black hole.
The time-outs it is recommended to set within 30-120 seconds. This is quite sufficient for normal users, and this is quite sufficient for the timely clearing NAT-table, which excludes its overflow. And do not forget to enter the appropriate 
changes to /etc/rc.local and /etc/sysctl.conf


Results

After tuning you will get a viable and productive NAT-server. Of course, this is only the basic tuning – we are not concerned, for example, kernel tuning, etc. things. However, in most cases even such simple actions will be sufficient for normal operation of a sufficiently large network. As I said earlier, our network of more than 30 thousand subscribers, the traffic which was treated with 4 NAT-server.

______________________________________________________________________________________________

Below is another page of reference on the values.
______________________________________________________________________________________________

There are two parameters we can play with:
- the maximum number of allowed conntrack entries, which will be called
  CONNTRACK_MAX in this document
- the size of the hash table storing the lists of conntrack entries, which
  will be called HASHSIZE (see below for a description of the structure)


CONNTRACK_MAX is the maximum number of "sessions" (connection tracking entries)
that can be handled simultaneously by netfilter in kernel memory.


A conntrack entry is stored in a node of a linked list, and there are several
lists, each list being an element in a hash table.  So each hash table entry
(also called a bucket) contains a linked list of conntrack entries.
To access a conntrack entry corresponding to a packet, the kernel has to:
- compute a hash value according to some defined characteristics of the packet.
  This is a constant time operation.
  This hash value will then be used as an index in the hash table, where a
  list of conntrack entries is stored.
- iterate over the linked list of conntrack entries to find the good one.
  This is a more costly operation, depending on the size of the list (and on
  the position of the wanted conntrack entry in the list).


The hash table contains HASHSIZE linked lists.  When the limit is reached
(the total number of conntrack entries being stored has reached CONNTRACK_MAX),
each list will contain ideally (in the optimal case) about
CONNTRACK_MAX/HASHSIZE entries.


The hash table occupies a fixed amount of non-swappable kernel memory,
whether you have any connections or not.  But the maximum number of conntrack
entries determines how many conntrack entries can be stored (globally into the
linked lists), i.e. how much kernel memory they will be able to occupy at most.


This document will now give you hints about how to choose optimal values for
HASHSIZE and CONNTRACK_MAX, in order to get the best out of the netfilter
conntracking/NAT system.


Default values of CONNTRACK_MAX and HASHSIZE
============================================


By default, both CONNTRACK_MAX and HASHSIZE get average values for
"reasonable" use, computed automatically according to the amount of
available RAM.


Default value of CONNTRACK_MAX
------------------------------


On i386 architecture, CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 =
RAMSIZE (in MegaBytes) * 64.
So for example, a 32 bits PC with 512MB of RAM can handle 512*1024^2/16384 =
512*64 = 32768 simultaneous netfilter connections by default.


But the real formula is:
CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 / (x / 32)
where x is the number of bits in a pointer (for example, 32 or 64 bits)


Please note that:
- default CONNTRACK_MAX value will not be inferior to 128
- for systems with more than 1GB of RAM, default CONNTRACK_MAX value is
  limited to 65536 (but can of course be set to more manually).


Default value of HASHSIZE
-------------------------


By default, CONNTRACK_MAX = HASHSIZE * 8.  This means that there is an average
of 8 conntrack entries per linked list (in the optimal case, and when
CONNTRACK_MAX is reached), each linked list being a hash table entry
(a bucket).


On i386 architecture, HASHSIZE = CONNTRACK_MAX / 8 =
RAMSIZE (in bytes) / 131072 = RAMSIZE (in MegaBytes) * 8.
So for example, a 32 bits PC with 512MB of RAM can store 512*1024^2/128/1024 =
512*8 = 4096 buckets (linked lists)


But the real formula is:
HASHSIZE = CONNTRACK_MAX / 8 = RAMSIZE (in bytes) / 131072 / (x / 32)
where x is the number of bits in a pointer (for example, 32 or 64 bits)


Please note that:
- default HASHSIZE value will not be inferior to 16
- for systems with more than 1GB of RAM, default HASHSIZE value is limited
  to 8192 (but can of course be set to more manually).


Reading CONNTRACK_MAX and HASHSIZE
==================================


Current CONNTRACK_MAX value can be read at runtime, via the /proc filesystem.


Before Linux kernel version 2.4.23, use:
# cat /proc/sys/net/ipv4/ip_conntrack_max


Since Linux kernel version 2.4.23 (thus Linux 2.6 as well), use:
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
  (old /proc/sys/net/ipv4/ip_conntrack_max is then deprecated!)


Current HASHSIZE is always available (for every kernel version) in syslog
messages, as the number of buckets (which is HASHSIZE) is printed there at
ip_conntrack initialization.
Since Linux kernel version 2.4.24 (thus Linux 2.6 as well), current HASHSIZE
value can be read at runtime with:
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_buckets


Modifying CONNTRACK_MAX and HASHSIZE
====================================


Default CONNTRACK_MAX and HASHSIZE values are reasonable for a typical host,
but you may increase them on high-loaded firewalling-only systems.
So CONNTRACK_MAX and HASHSIZE values can be changed manually if needed.


While accessing a bucket is a constant time operation (hence the interest
of having a hash of lists), keep in mind that the kernel has to iterate over
a linked list to find a conntrack entry.  So the average size of a linked
list (CONNTRACK_MAX/HASHSIZE in the optimal case when the limit is reached)
must not be too big.  This ratio is set to 8 by default (when values are
computed automatically).
On systems with enough memory and where performance really matters, you can
consider trying to get an average of one conntrack entry per hash bucket,
which means HASHSIZE = CONNTRACK_MAX.


Setting CONNTRACK_MAX
---------------------


Conntrack entries are stored in linked lists, so the maximum number of
conntrack entries (CONNTRACK_MAX) can be easily configured dynamically.


Before Linux kernel version 2.4.23, use:
# echo $CONNTRACK_MAX > /proc/sys/net/ipv4/ip_conntrack_max


Since Linux kernel version 2.4.23 (thus Linux 2.6 as well), use:
# echo $CONNTRACK_MAX > /proc/sys/net/ipv4/netfilter/ip_conntrack_max


where $CONNTRACK_MAX is an integer.


Setting HASHSIZE
----------------


For mathematical reasons, hash tables have static sizes.  So HASHSIZE must be
determined before the hash table is created and begins to be filled.


Before Linux kernel version 2.4.21, a prime number should be chosen for hash
size, ensuring that the hash table will be efficiently populated. Odd
non-prime numbers or even numbers are strongly discouraged, as the hash
distribution will be sub-optimal.


Since Linux kernel version 2.4.21 (thus Linux 2.6 as well), conntrack
uses jenkins2b hash algorithm which is happy with all sizes, but power
of 2 works best.


If netfilter conntrack is statically compiled in the kernel, the hash table
size can be set at compile time, or (since kernel 2.6) as a boot option with
ip_conntrack.hashsize=$HASHSIZE


If netfilter conntrack is compiled as a module, the hash table size can be
set at module insertion, with the following command:
# modprobe ip_conntrack hashsize=$HASHSIZE


where $HASHSIZE is an integer.


Since 2.6.14, it is possible to set hashsize dynamically at runtime,
after boot and module load.


Between 2.6.14 and 2.6.19 (included), use:
# echo $HASHSIZE > /sys/module/ip_conntrack/parameters/hashsize


Since 2.6.20, use:
# echo $HASHSIZE > /sys/module/nf_conntrack/parameters/hashsize


Ideal case: firewalling-only machine
------------------------------------


In the ideal case, you have a machine _just_ doing packet filtering and NAT
(i.e. almost no userspace running, at least none that would have a growing
memory consumption like proxies, ...).


The size of kernel memory used by netfilter connection tracking is:
size_of_mem_used_by_conntrack (in bytes) =
        CONNTRACK_MAX * sizeof(struct ip_conntrack) +
        HASHSIZE * sizeof(struct list_head)
where:
- sizeof(struct ip_conntrack) can vary quite much, depending on architecture,
  kernel version and compile-time configuration. To know its size, see the
  kernel log message at ip_conntrack initialization time.
  sizeof(struct ip_conntrack) is around 300 bytes on i386 for 2.6.5, but
  heavy development around 2.6.10 make it vary between 352 and 192 bytes!
- sizeof(struct list_head) = 2 * size_of_a_pointer
  On i386, size_of_a_pointer is 4 bytes.


So, on i386, kernel 2.6.5, size_of_mem_used_by_conntrack is around
CONNTRACK_MAX * 300 + HASHSIZE * 8 (bytes).


If we take HASHSIZE = CONNTRACK_MAX (if we have most of the memory dedicated
to firewalling, see "Modifying CONNTRACK_MAX and HASHSIZE" section above),
size_of_mem_used_by_conntrack would be around CONNTRACK_MAX * 308 bytes
on i386 systems, kernel 2.6.5.


Now suppose your firewalling-only box has 512MB of RAM (a decent amount
of memory considering today's memory prices). You have to spare a bit of
memory for a few applications (syslog, etc.): 128MB should really be big
enough for a firewall in console mode, for example.
The rest can be dedicated to conntrack entries.
Then you could set both CONNTRACK_MAX and HASHSIZE approximately to:
(512 - 128) * 1024^2 / 308 =~ 1307315 (instead of 32768 for CONNTRACK_MAX,
and 4096 for HASHSIZE by default).
Since Linux 2.4.21 (thus Linux 2.6 as well), hash algorithm is happy with
"power of 2" sizes (it used to be a prime number before).


So here we can set CONNTRACK_MAX and HASHSIZE to 1048576 (2^20), for example.


This way, you can store about 32 times more conntrack entries than the
default, and get better performance for conntrack entry access.

6 comments:

Anonymous said...

Hello, Anthony!

Can you let know what kind of cpu/memory usage and conntrack active connection numbers you are seeing with your setup, as well as the number of users you are serving?

Does the latency also increase with increasing active connections?

Do you allow your users to use p2p or is it just standard web-browsing behavior?

Thank you!

Anthony Chin said...

Hi,

I was using it for my public cloud server, which utilized all the 8GB memory that I have. But for the CPU load, it wasn't really there. As I'm using Apache to do cluster with 8 backend Apache for the processing.

It was pure http traffic, no P2P at all.

WinningIndustries said...

Great thanks, this helped us very well!

imperia said...

I would use >> (append) instead of > (overwrite/create) redirect command for this example:

echo net.ipv4.netfilter.ip_conntrack_max = 196608 > /etc/sysctl.conf

Anthony Chin said...

Hi Imperia,

Thanks for pointing out..
Typos..

yodog said...

awesome info thanks :)