RabbitMQ net-split messages explanation
After experiencing a network problem RabbitMQ writes a record to logs that looks like this:
=INFO REPORT==== 30-Jan-2017::19:04:04 ===
node '[email protected]' down: connection_closed
In this case the reason is connection_closed
. But sometimes it
may not be evident what this actually means or what could have
caused this error. Especially in some outright bizzare situations.
Here I’m trying to document all the reasons that I’ve seen and how
you can reproduce them.
connection_closed
This happens any time when a connection is closed using “normal” mechanisms. Some ways to reproduce it:
- Stop a remote RabbitMQ node
- Send RST from a remote node, e.g. using
iptables
- Attach to a running ErlangVM with
gdb
and docall close(some-fd)
here
net_tick_timeout
Any time when a remote node stops responding - for sender it looks like blackholing. Some reasons:
- Loss of network connectivity between 2 machines
- Death of a remote machine
- Firewall rule that drops packets
- Somebody is sending a very big chunk of data through RabbitMQ
cluster channel. E.g. such a big AMQP messages that it’s enough
to saturate network for at least
net_tick_timeout
.
disconnect
Explicit disconnect performed using
erlang:disconnect_node/1
. Either by some internal RabbitMQ
mechanism or by somebody messing with rabbitmqctl eval
.
etimedout
Another quite interesting reason. I believe that this can happen
only when OS TCP stack is tuned in a such way that TCP timeout is
less than net_tick_timeout
. On Linux this can be reproduced with
some extreme tuning:
cd /proc/sys/net/ipv4
echo 2 > tcp_keepalive_intvl
echo 1 > tcp_keepalive_probes
echo 2 > tcp_keepalive_time
echo 1 > tcp_retries1
echo 2 > tcp_retries1
econnreset
This is the most strange of all reasons which I’ve seen only in
production logs but can’t reproduce myself. One very probable
explanation is that RST packet has arrived with an exceptionally
bad timing - just after a socket was returned from epoll
as a
ready one, but before read/write operation on it actually started.