Manual remote bootstrap of failed peer
When a Raft peer fails, YugabyteDB executes an automatic remote bootstrap to create a new peer from the remaining ones.
If a majority of Raft peers fail for a given tablet, you need to manually execute a remote bootstrap. A list of tablets is available via the yb-master-ip:7000/tablet-replication yb-admin UI.
Assume you have a cluster where the following applies:
- Replication factor is 3.
 - A tablet with UUID 
TABLET1. - Three tablet peers, with one in good working order, referred to as 
NODE_GOOD, and two broken peers referred to asNODE_BAD1andNODE_BAD2. - Some of the tablet-related data is to be copied from the good peer to each of the bad peers until the majority of them are restored.
 
These are the steps to follow:
- 
Delete the tablet from the broken peers if necessary, by running:
yb-ts-cli --server_address=NODE_BAD1 delete_tablet TABLET1 yb-ts-cli --server_address=NODE_BAD2 delete_tablet TABLET1 - 
Trigger a remote bootstrap of
TABLET1fromNODE_GOODtoNODE_BAD1.yb-ts-cli --server_address=NODE_BAD1 remote_bootstrap NODE_GOOD TABLET1 
After the remote bootstrap finishes, NODE_BAD2 should be automatically removed from the quorum and TABLET1 fixed, as it has gotten a majority of healthy peers.
If you can't perform the preceding steps, you can do the following to manually execute the equivalent of a remote bootstrap:
- 
On
NODE_GOOD, create an archive of the WALS (Raft data), RocksDB (regular) directories, intents (transactions data), and snapshots directories forTABLET1. - 
Copy these archives over to
NODE_BAD1, on the same drive thatTABLET1currently has its Raft and RocksDB data. - 
Stop
NODE_BAD1, as the file system data underneath will change. - 
Remove the old WALS, RocksDB, intents, snapshots data for
TABLET1fromNODE_BAD1. - 
Unpack the data copied from
NODE_GOODinto the corresponding (now empty) directories onNODE_BAD1. - 
Restart
NODE_BAD1so it can bootstrapTABLET1using this new data. - 
Restart
NODE_GOODso it can properly observe the changed state and data onNODE_BAD1. 
At this point, NODE_BAD2 should be automatically removed from the quorum and TABLET1 fixed, as it has gotten a majority of healthy peers.
Note that typically, when you try to find tablet data, you would use a find command across the --fs_data_dir paths.
In the following example, assume that is set to /mnt/d0 and your tablet UUID is c08596d5820a4683a96893e092088c39:
find /mnt/d0/ -name '*c08596d5820a4683a96893e092088c39*'
/mnt/d0/yb-data/tserver/wals/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39
/mnt/d0/yb-data/tserver/tablet-meta/c08596d5820a4683a96893e092088c39
/mnt/d0/yb-data/tserver/consensus-meta/c08596d5820a4683a96893e092088c39
/mnt/d0/yb-data/tserver/data/rocksdb/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39
/mnt/d0/yb-data/tserver/data/rocksdb/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39.intents
/mnt/d0/yb-data/tserver/data/rocksdb/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39.snapshots
The data you would be interested is the following:
- 
For the Raft WALS:
/mnt/d0/yb-data/tserver/wals/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39 - 
For the RocksDB regular database:
/mnt/d0/yb-data/tserver/data/rocksdb/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39 - 
For the intents files:
/mnt/d0/yb-data/tserver/data/rocksdb/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39.intents - 
For the snapshot files:
/mnt/d0/yb-data/tserver/data/rocksdb/table-2fa481734909462385e005ba23664537/tablet-c08596d5820a4683a96893e092088c39.snapshots