·13 min read
Crunchy Data products often include High Availability.Patroni and etcdare two of our go-to tools for managing those environments. Today I wanted toexplore how these work together. Patroni relies on proper operation of the etcdcluster to decide what to do with PostgreSQL. When communication between thesetwo pieces breaks down, it creates instability in the environment resulting infailover, cluster restart, and even the the loss of a primary database. To fullyunderstand the importance of this relationship, we need to understand a few coreconcepts of how these pieces work. First, we'll start with a brief overview ofthe components involved in HA systems and their role in the environment.
Overview of HA Infrastructure
HA systemscan be setup in a single or multi-datacenter configuration. Crunchy supports HAon cloud,traditional,orcontainerizedinfrastructure. When used in a single datacenter, the environment is typicallysetup as a 3-node cluster on three separate database hosts. When used acrossmultiple datacenters, the environment typically has an active datacenter, wherethe primary HA cluster and applications are running, and one or more standbydatacenters, each containing a replica HA cluster that is always available.Although the setup may be different, the basic components andprimary function of the environmentremains the same.
Main HA components:
For this article, we will focus on three basic components which are essential inboth single datacenter and multi-datacenter environments:
- PostgreSQL cluster: the database cluster, usually consisting of a primary andtwo or more replicas
- Patroni: used as the failover management utility
- etcd: used as a distributed configuration store (DCS), containing clusterinformation such as configuration, health, and current status.
How HA components work together:
Each PostgreSQL instance within the cluster has one application database. Theseinstances are kept in sync through streaming replication. Each database host hasits own Patroni instance which monitors the health of its PostgreSQL databaseand stores this information in etcd. The Patroni instances use this data to:
- keep track of which database instance is primary
- maintain quorum among available replicas and keep track of which replica isthe most "current"
- determine what to do in order to keep the cluster healthy as a whole
Patroni manages the instances by periodically sending out a heartbeat request toetcd which communicates the health and status of the PostgreSQL instance. etcdrecords this information and sends a response back to Patroni. The process issimilar to a heart monitoring device. Consistent, periodic pulses indicate ahealthy database.
etcd Consensus Protocol
The etcd consensus protocol requires etcd cluster members to write every requestdown to disk, making it very sensitive to disk write latency. If Patronireceives an answer from etcd indicating the primary is healthy before theheartbeat times out, the replicas will continue to follow the current primary.
If the etcd system cannot verify writes before the heartbeats time out, or ifthe primary instance fails to renew its status as leader, Patroni will assumethe cluster member is unhealthy and put the database into a fail-safeconfiguration. This will trigger an election to promote a new leader and the oldprimary is demoted and becomes a replica.
Common Causes of Communication Failures
Communication failure between Patroni and etcd is one of the most common reasonsfor failover in HA environments. Some of the most common reasons forcommunication issues are:
- an under-resourced file system
- I/O contention in the environment
- network transit timeouts
Under-resourced file system
Because HA solutions must be sufficiently resourced at all points at all timesto work well, the proper resources must be available to the etcd server in orderto mitigate failovers. As mentioned before, etcd consensus protocol requiresetcd cluster members to write every request down to disk and every time a key isupdated for a cluster, a new revision is created. When the system runs low onspace (usage above 75%), etcd goes read/delete only until revisions and keys areremoved or disk space is added. For optimal performance, we recommend keepingdisk usage below 75%.
The etcd consensus protocol requires etcd cluster members to write every requestdown to disk, making it very sensitive to disk write latency. Systems underheavy loads, particularly during peak or maintenance hours, are susceptible toI/O bottlenecks as processes are forced to compete for resources. Thiscontention can increase I/O wait time and prevent Patroni from receiving ananswer from etcd before the heartbeat times out. This is especially true whenrunning virtual machines as neighboring machines can impact I/O. Other sourcesof contention might be heavy I/O from PostgreSQL and excessive paging due tohigh connection rates and/or memory starvation.
The etcd system, which is critical for the stability of the HA solution, isexperiencing issues that register as network transit timeouts. This could be dueto either actual network timeouts or massive resource starvation at the etcdlevel. If you notice timeout errors typically coincide with periods of heavynetwork traffic, then network delay could be the root cause of these timeouterrors.
Diagnosing the system
Confirm the issue
When troubleshooting your system, the best place to start is by checking thelogs. If communication issues between Patroni and etcd are at the heart of theissue, you will most likely see errors in your log files like the examplesbelow.
First, check PostgreSQL logs in order to rule out any issues with the PostgreSQLhost itself. By default, these logs are stored under
pg_log in the PostgreSQLdata directory. If your logs are not in the default location, you can determinethe exact location by running the command
show log_directory ; in thedatabase. Check for any indication of other PostgreSQL processes crashing orbeing killed prior to the error message. For example:
WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.
If no other PostgreSQL processes crashed, check the Patroni logs for any errorsor events that may have occurred shortly before this error was logged byPostgreSQL. Messages like
demoted self because failed to update leader lock in DCS and
Loop time exceeded indicate communication and timeout issues with etcd.
Feb 1 13:45:05 patroni: 2021-02-01 13:45:05,510 INFO: Selected new etcd server http://10.84.32.146:2379Feb 1 13:45:05 patroni: 2021-02-01 13:45:05,683 INFO: demoted self because failed to update leader lock in DCSFeb 1 13:45:05 patroni: 2021-02-01 13:45:05,684 WARNING: Loop time exceeded, rescheduling immediately.Feb 1 13:45:05 patroni: 2021-02-01 13:45:05,686 INFO: closed patroni connection to the postgresql clusterFeb 1 13:45:05 patroni: 2021-02-01 13:45:05,705 INFO: Lock owner: None; I am pg1Feb 1 13:45:05 patroni: 2021-02-01 13:45:05,706 INFO: not healthy enough for leader raceFeb 1 13:45:06 patroni: 2021-02-01 13:45:06,657 INFO: starting after demotion in progressFeb 1 13:45:09 patroni: 2021-02-01 13:45:09,521 INFO: postmaster pid=1521Feb 1 13:45:11 patroni: /var/run/postgresql:5434 - rejecting connections
If you see a message like the one above, the next step is to check etcd logs.Look for messages logged right before the message logged in the Patroni logs. Ifcommunication issues are to blame for your environment's behavior, you willlikely see errors like the one below.
Feb 1 13:44:21 etcd: failed to send out heartbeat on time (exceeded the 100ms timeout for 39.177252ms)Feb 1 13:44:21 etcd: server is likely overloaded
Narrow down the cause
Check disk space
To see if a lack of disk space is the root of the problem, check the disk spaceavailable to the etcd system by running the linux command
df on the etcddirectory, typically
/var/lib/etcd. The disk space available to this directoryshould be checked on all servers, including the Patroni server. For optimalperformance, we recommend keeping disk usage below 75%. If the amount of spaceused is approaching or exceeding 75%, then allocating more space to thisdirectory may resolve the issue. (More information in the Recommendation sectionfurther below.)
Analyze performance metrics
If the file system still has plenty of space available, we will need to digdeeper to find the source of the problem by analyzing the overall performance ofthe system. A good place to start is with the
sar command which is part of theLinux
sysstat package and can be run on the system by any user. This commandprovides additional information about the system, such as system load, memoryand CPU usage which can be used to pinpoint any bottlenecks or pain points inyour system. By default, the command displays CPU activity and collects thesestatistics every 10 minutes.
The nice thing about
sar is that it stores historical data by default with aone-month retention. On RHEL/CentOS/Fedora distributions, this data is storedunder
/var/log/sa/ for Debian/Ubuntu systems, it's stored under
/var/log/sysstat/. The log files are named
dd represents theday of the month. For example, the log file for the first of the month would be
sa01, the file for the 15th would be
This means that if the
sysstat package was installed and running on the serverwhen the etcd timeout occurred, we can go back and analyze the performance dataaround the time of the incident. Note: Because the
sar command onlyreports on local activities, each of the servers in the etcd quorum will need tobe checked. If the
sysstat package was not installed or was not running duringthe time of the incident, it will need to be installed and enabled so that thisinformation will be available the next time the etcd timeout issue occurs. Forour purposes, we will assume the package was running.
Going back to the etcd log example we looked at earlier, we can see that thetimeout issue occurred at 13:44:21 on the first of the month. By specifying therelevant file name along with a start and end time in our
sar command, we canextract the information relevant to the time of the incident. Note: Use astart time slightly before the timestamp of the error in order to see the stateof the system before the timeout was triggered. For example:
sar -f /var/log/sa/sa01 -s 13:35:00 -e 13:50:00
-f: file name and path
-s: start time, in HH:MM:SS format
-e: end time, in HH:MM:SS format
Should give us an output that looks something like:
[user@localhost ~]$ sar -f /var/log/sa/sa01 -s 13:30:00 -e 13:51:0001:30:01 PM CPU %usr %nice %sys %iowait %steal %idle01:40:01 PM all 2.71 0.00 2.02 0.92 0.00 94.3201:50:01 PM all 2.10 0.00 1.79 7.86 0.00 88.22Average: all 2.41 0.00 1.91 4.39 0.00 91.27
Here we can see a jump in
%iowait between 1:40pm and 1:50pm, indicating asudden burst of activity around the time of the etcd error, as suspected.
-d flag to the command will let us take a closer look at eachdevice block and allow us to compare how long the I/O request took from start tofinish (
await column) with how long the requests actually took to complete(the
[user@localhost ~]$ sar -f /var/log/sa/sa01 -s 13:30:00 -e 13:51:00 -d01:30:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util01:40:01 PM dev8-0 4.36 2.19 49.63 11.88 0.03 7.59 5.67 2.4701:40:01 PM dev8-1 0.66 1.92 317.55 480.49 0.05 74.76 6.63 0.4401:40:01 PM dev8-2 0.09 0.85 1.05 21.13 0.00 4.31 4.31 0.0401:40:01 PM dev8-3 7.14 0.43 175.16 24.58 0.06 7.97 3.81 2.7201:50:01 PM dev8-0 4.40 1.91 45.88 10.86 1.00 226.16 28.65 12.6101:50:01 PM dev8-1 0.48 0.00 10.07 20.83 0.12 245.13 73.72 3.5601:50:01 PM dev8-2 0.24 0.00 57.48 241.24 0.27 1123.66 235.00 5.6001:50:01 PM dev8-3 6.91 0.01 165.57 23.98 0.78 112.61 19.24 13.28
Since requests can lose time waiting in a queue if the device is already busyand won't accept additional concurrent requests, it is not unusual for theservice time to be slightly smaller than the waiting time. However, in thisexample we can see that I/O requests on dev8-2 took an average of 1123.66ms fromstart to finish even though they only took an average of 235ms to actuallycomplete, which is a significant increase from where is was previously, whenboth
svctm only took 4.31ms. Considering these times are averages,it isn't hard to imagine that any spikes that may have occurred were likely muchhigher than the time shown in this output. If you find similar jumps in yourenvironment, then an under-resourced system is likely the cause of the timeouterrors.
Solutions and Suggested Steps
Now that we have a better idea of what might be causing the issue, here are somethings we can do to fix it:
If the etcd directory is running low on space (i.e. the amount of space used isapproaching or exceeding 75%), allocate more disk space to this directory andsee if the heartbeat timeout issue is resolved. Similarly, if you find spikes of
%iowait and I/O contention that correlate with the time of the timeoutincident, we recommend increasing the IOPS on all systems running the etcdquorum.
Find the cause of I/O spikes
While increasing resources may help in the short term, identifying the cause ofthe await jump is key to determining a long-term solution. Work with yoursystems administrator to diagnose and resolve the underlying cause of I/Ocontention in your environment.
Relocate your etcd
If etcd is sharing a storage device with another resource, consider relocatingthe etcd data to its own dedicated device to ensure that etcd has a dedicatedI/O queue for any I/O that it needs. In a multi-node environment, this means onenode should be dedicated entirely to etcd. For optimum performance choose adevice with low-latency networking and low-latency storage I/O.
Please note: If the underlying issue is the disk itself, rather than just thedisk performance, moving the etcd data to a new storage device may not fullycorrect the issue if other parts of the cluster are still reliant on the disk.
Resolve network delay
If communication errors persist after you have increased resources to the etcdsystem, then the only remaining cause is network delay. For a long-termsolution, you will need to work with your network administrator to diagnose andresolve the underlying cause of network delay in your environment.
Increase Timeout Intervals
While increasing resources and resolving the underlying issue of I/O contentionand/or network delay are the only way to fully resolve the issue, a short-termsolution to the problem would be to increase the timeout interval. This willgive etcd more time to verify and write requests to disk before timing out andPatroni triggers an election.
If you are usingCrunchy Data's High-Availabilitysolution, this can be accomplished by changing the
heartbeat_intervalparameter in your group_vars/etcd.yml file and rerunning your playbook. Below isan example as to how it should look like:
etcd_user_member_parameters: heartbeat_interval: <value>
If you are using another solution in your environment, you should be able toincrease this setting by changing the parameter in your etcd configuration file,typically located under
IMPORTANT: Setting the heartbeat interval to a value that's too high willresult in long election timeouts and the etcd cluster will take longer to detectleader failure. This should be treated as a last-ditch effort and only used as away of mitigating the issue until the underlying cause can be diagnosed andresolved.
Hopefully you now have a better understanding of why Patroni's timely andconsistent communication with etcd is essential to maintaining a healthy HAenvironment as well what you can do to diagnose and fix communication issuesbetween the two.
Crunchy Data strongly recommends ensuring a good, reliable network to your DCSto prevent failover from occurring. We also strongly recommendmonitoringyour environment for disk space issues, archiving issues, failover occurrences,and replication slot failures.
November 3, 2021 •More by this author