Settings for an healthy Event Store cluster

When we configure a cluster of nodes we must verify a bunch of settings that control how the nodes communicate with each other. Nowadays days our cluster will be probably hosted in some cloud network and therefore we must tune these bunch of config settings in order to have our cluster healthy and happy.

In this article we are focusing only on the settings that are used by each node to communicate with the other nodes within the same cluster and with the external client applications.

The nodes can communicate to the others for the following reasons:

To check if other nodes are up and running and to let them know that this node is healthy (GossipIntervalMs, GossipTimeoutMs)
To replicate the data from Master to Slaves and Clones (IntTcpHeartbeatInterval, IntTcpHeartbeatTimeout)
To forward requests to the Master node
(IntTcpHeartbeatInterval, IntTcpHeartbeatTimeout)
To communicate with client applications (ExtTcpHeartbeatInterval, ExtTcpHeartbeatTimeout)

Based on my experience, I use the following rule of thumbs to configure these settings. The Interval is set half of the Timeout. I found these basic rule working well in almost all the situation. You can adopt to decide how to set the values.

The Gossip messages

nodes sending Gossip messages at the configurable GossipInterval

When a node part of a cluster starts it check the state of the other members of the cluster sending and replying to Gossip messages. These messages are sent at a configurable interval and with a configurable timeout to wait for a reply before consider the other node dead.

When a Gossip message is not received within a configurable interval plus a timeout then the other node is considered dead and an election process is called. If the network is unstable or slow, or even if the overhead (disk, cpu, memory) is causing an excess of time to reply then the other nodes can consider us dead.

These situation can often happen when the cluster is hosted in a cloud and especially when the nodes are spread across different availability zones for data replication scenarios.

The same situation can sometime happen also in a on premise local network but it could be less frequent.

When a node fail to reply in the configured period and an election process is started, you can go through a double election as soon as your node will start sending Gossip messages again. When you see frequent re-elections in logs then you probably need to tune GossipIntervalMs and GossipTimeoutMs settings.

The GossipIntervalMSs defines the frequency of checks. The GossipTimeoutMs defines for how long we can wait each check before consider consider it failed.

Good settings for a normal cloud environment are the following:

GossipIntervalMs: 2000
GossipTimeoutMs: 4000

Internal TCP communications

When the nodes form a cluster, they have a role that is assigned at the end of the election process. A node can be Master, Slave or Clone. Let’s imagine that the node where we are is the Master. Our node is responsible to persist writes locally, replicate the writes to the slaves, wait as many ack’s as the configured quorum of nodes.

replica within a cluster configured by IntTcpHeartbeatInterval and IntTcpHeartbeatTimeout

This is one of the reason that having a cluster size with too many nodes can affect the performances as the Master must wait an increasing number of acks backs from the slave nodes.

Another reason to use the internal TCP channel between nodes is the request forwarding. When a client try to write on a slave or clone, its request is automatically forwarded to the current Master node using the internal TCP channel. The client is also informed about the forward in order to directly interact with the Master for the next write request.

Good settings for the internal TCP channel in a cloud environment:

IntTcpHeartbeatInterval: 1500
IntTcpHeartbeatTimeout: 3000

External TCP communications

One important rule to always take in mind is that these settings are on the server side. Most likely you have client applications connecting to the server, right? You must configure on the client side the similar settings to synchronise the acceptable period before consider the connection dead.

On the server side, the settings to consider are ExtTcpHeartbeatInterval and ExtTcpHeartbeatTimeout. On the client side the name of the method can slightly change but based on one of the official API (C#) you can use: SetHeartbeatInterval and SetHeartbeatTimeout.

Server side example of good ExtTcp settings for a cloud environment

ExtTcpHeartbeatInterval: 3000
ExtTcpHeartbeatTimeout: 6000

Client side example of connection configuration using the C# client (it’s similar in any of the other clients in different languages):

yourclientconfig
.SetHeartbeatInterval(TimeSpan.FromSeconds(3))
.SetHeartbeatTimeout(TimeSpan.FromSeconds(6))

Conclusion

The above settings can be a good starting point in your configuration. I usually configure these without relying on the defaults. You can also change my rule of thumb rule to have the interval half of the timeout value. You can have the same value for both as an example. Test these settings in a staging environment and look for the frequency of the election process and the socket connection timeouts between nodes and clients to tune and refine them.

The settings that I found working in almost all the environment cloud or local networks are

IntTcpHeartbeatInterval: 1500
IntTcpHeartbeatTimeout: 3000
ExtTcpHeartbeatInterval: 3000
ExtTcpHeartbeatTimeout: 6000
GossipIntervalMs: 2000
GossipTimeoutMs: 4000

client side 
yourclientconfig
.SetHeartbeatInterval(TimeSpan.FromSeconds(3))
.SetHeartbeatTimeout(TimeSpan.FromSeconds(6))

Event Store has a very rich list of configuration settings that you can combine with the few discussed in this post. Here is the links to the official documentation
https://eventstore.org/docs/dotnet-api/connecting-to-a-server/index.html
https://eventstore.org/docs/server/command-line-arguments/index.html
https://eventstore.org/docs/server/ports-and-networking/index.html

Riccardo Di Nuzzo in UK

…there is no fun maintaining existing piles of crap

Settings for an healthy Event Store cluster

The Gossip messages

Internal TCP communications

External TCP communications

Conclusion

Related