Failure of upgrades to release versions 2.18 and 2.20
Product | Affected Versions | Related Issues | Fixed In |
---|---|---|---|
YugabyteDB, YugabyteDB Anywhere | v2.18, v2.20 | #21491 | v2.20.2.1, v2.18.7.0 |
Description
Upgrading from prior versions (other than v2.14, v2.16) to v2.18 or v2.20 fails due to a race condition during post upgrade. While the yb-tservers themselves can be healthy and their raft configurations can remain intact, they will fail to heartbeat to the yb-master.
This is a race condition (that can happen even while the probability is low) that requires YugabyteDB Anywhere to execute a post-upgrade action of un-blacklisting yb-tservers at the exact same time as yb-master executing a background task of generating universe_uuid
field. This issue is less likely in v2.20 due to the post upgrade actions taking much longer in v2.20 compared to v2.18, significantly reducing the probability of hitting this issue.
v2.14 and v2.16 releases are not impacted by this issue. This is because the flag master_enable_universe_uuid_heartbeat_check
is not auto-promoted and so the functionality is OFF by default until you explicitly turn it ON.
Mitigation
Set the master_enable_universe_uuid_heartbeat_check
flag on yb-master to false. It can be performed as a non-rolling, non-restart YugabyteDB Anywhere upgrade after the database upgrade is complete.
After this flag change is applied, upgrade to a release with the fix and to re-enable the flag.
Re-enabling the flag requires running a yb-ts-cli command to clear the universe_uuid
on all nodes. After the universe_uuid
is cleared, the flag can be re-enabled on yb-master.
Details
The universe_uuid
field was added to ClusterConfig
as part of #17904. This is essentially an identity for the universe which all the yb-tservers inherit from the yb-master as part of the heartbeat. If set, this value is not meant to change on either the yb-tservers or yb-masters and provides a way for the yb-master to reject any heartbeats from a different universe.
For universes upgrading from an older release to one having the preceding commit, the catalog manager generates a new universe_uuid
and propagates that to the yb-tserver. However, before persisting the universe_uuid
in cluster_config
, the version number is not being incremented.
As a result of this, the following race is possible:
- Cluster gets upgraded to a release with commit fb98e56 and the feature
master_enable_universe_uuid_heartbeat_check
is enabled due to promotion of flags. - YugabyteDB Anywhere reads the cluster configuration (ClusterConfig) at version 'X'.
- Catalog manager background thread runs and generates a new
universe_uuid
, persists it in ClusterConfig and propagates it to all the yb-tservers. - YugabyteDB Anywhere from Step 2 updates the ClusterConfig using
ChangeMasterClusterConfigRequestPB
with version 'X'. (For un-blacklisting nodes) - Update from Step 4 succeeds because ClusterConfig version 'X' on disk matches the one in the request 'X', effectively overwriting the
universe_uuid
generated in Step 3. - Catalog manager background thread runs again and because the
universe_uuid
is empty, it generates a new one again.
After the new universe_uuid
is generated on the catalog manager in Step 6, yb-master essentially starts rejecting heartbeats from all the yb-tservers which keep reporting the previous universe_uuid
generated by the catalog manager in Step 3.