After experiencing a network problem RabbitMQ writes a record to logs that looks like this:
=INFO REPORT==== 30-Jan-2017::19:04:04 === node [email protected]' down: connection_closed
In this case the reason is
connection_closed. But sometimes it
may not be evident what this actually means or what could have
caused this error. Especially in some outright bizzare situations.
Here I'm trying to document all the reasons that I've seen and how
you can reproduce them.
This happens any time when a connection is closed using "normal" mechanisms. Some ways to reproduce it:
Any time when a remote node stops responding - for sender it looks like blackholing. Some reasons:
Explicit disconnect performed using
erlang:disconnect_node/1. Either by some internal RabbitMQ
mechanism or by somebody messing with
Another quite interesting reason. I believe that this can happen
only when OS TCP stack is tuned in a such way that TCP timeout is
net_tick_timeout. On Linux this can be reproduced with
some extreme tuning:
cd /proc/sys/net/ipv4 echo 2 > tcp_keepalive_intvl echo 1 > tcp_keepalive_probes echo 2 > tcp_keepalive_time echo 1 > tcp_retries1 echo 2 > tcp_retries1
This is the most strange of all reasons which I've seen only in
production logs but can't reproduce myself. One very probable
explanation is that RST packet has arrived with an exceptionally
bad timing - just after a socket was returned from
epoll as a
ready one, but before read/write operation on it actually started.