ClickHouse Keeper (clickhouse-keeper)

ClickHouse Cloud 中不支持

注意

此页面不适用于 ClickHouse Cloud。此处记录的程序在 ClickHouse Cloud 服务中是自动化的。

ClickHouse Keeper 为数据复制和分布式 DDL 查询执行提供协调系统。ClickHouse Keeper 与 ZooKeeper 兼容。

实现细节

ZooKeeper 是最早的知名开源协调系统之一。它用 Java 实现，并具有相当简单而强大的数据模型。ZooKeeper 的协调算法 ZooKeeper Atomic Broadcast (ZAB) 不提供读取的线性一致性保证，因为每个 ZooKeeper 节点都在本地提供读取服务。与 ZooKeeper 不同，ClickHouse Keeper 使用 C++ 编写，并使用 RAFT 算法实现。此算法允许读取和写入的线性一致性，并且有多种不同语言的开源实现。

默认情况下，ClickHouse Keeper 提供与 ZooKeeper 相同的保证：线性一致性写入和非线性一致性读取。它具有兼容的客户端-服务器协议，因此任何标准 ZooKeeper 客户端都可用于与 ClickHouse Keeper 交互。快照和日志与 ZooKeeper 的格式不兼容，但 clickhouse-keeper-converter 工具可以转换 ZooKeeper 数据到 ClickHouse Keeper 快照。ClickHouse Keeper 中的服务器间协议也与 ZooKeeper 不兼容，因此不可能存在混合 ZooKeeper / ClickHouse Keeper 集群。

ClickHouse Keeper 支持访问控制列表 (ACL)，方式与 ZooKeeper 相同。ClickHouse Keeper 支持相同的权限集，并具有相同的内置方案：world、auth 和 digest。摘要身份验证方案使用 username:password 对，密码以 Base64 编码。

注意

不支持外部集成。

配置

ClickHouse Keeper 可以用作 ZooKeeper 的独立替代品，也可以用作 ClickHouse 服务器的内部组件。在这两种情况下，配置几乎是相同的 .xml 文件。

Keeper 配置设置

主要的 ClickHouse Keeper 配置标签是 <keeper_server>，具有以下参数

参数	描述	默认值
`tcp_port`	客户端连接的端口。	`2181`
`tcp_port_secure`	客户端和 keeper-server 之间 SSL 连接的安全端口。	-
`server_id`	唯一的服务器 ID，ClickHouse Keeper 集群的每个参与者都必须具有唯一的数字（1、2、3 等）。	-
`log_storage_path`	协调日志的路径，就像 ZooKeeper 一样，最好将日志存储在非繁忙节点上。	-
`snapshot_storage_path`	协调快照的路径。	-
`enable_reconfiguration`	通过 `reconfig` 启用动态集群重新配置。	`False`
`max_memory_usage_soft_limit`	keeper 最大内存使用量的软限制（字节）。	`max_memory_usage_soft_limit_ratio` * `physical_memory_amount`
`max_memory_usage_soft_limit_ratio`	如果未设置 `max_memory_usage_soft_limit` 或设置为零，我们将使用此值定义默认软限制。	`0.9`
`cgroups_memory_observer_wait_time`	如果未设置 `max_memory_usage_soft_limit` 或设置为 `0`，我们将使用此间隔来观察物理内存量。一旦内存量发生变化，我们将通过 `max_memory_usage_soft_limit_ratio` 重新计算 Keeper 的内存软限制。	`15`
`http_control`	HTTP 控制接口的配置。	-
`digest_enabled`	启用实时数据一致性检查	`True`
`create_snapshot_on_exit`	在关闭期间创建快照	-
`hostname_checks_enabled`	启用集群配置的健全性主机名检查（例如，如果 localhost 与远程端点一起使用）	`True`
`four_letter_word_white_list`	4lw 命令的白名单。	`conf, cons, crst, envi, ruok, srst, srvr, stat, wchs, dirs, mntr, isro, rcvr, apiv, csnp, lgif, rqld, ydld`

其他常用参数继承自 ClickHouse 服务器配置（listen_host、logger 等）。

内部协调设置

内部协调设置位于 <keeper_server>.<coordination_settings> 部分，并具有以下参数

参数	描述	默认值
`operation_timeout_ms`	单个客户端操作的超时时间 (毫秒)	`10000`
`min_session_timeout_ms`	客户端会话的最小超时时间 (毫秒)	`10000`
`session_timeout_ms`	客户端会话的最大超时时间 (毫秒)	`100000`
`dead_session_check_period_ms`	ClickHouse Keeper 检查死会话并删除它们的频率 (毫秒)	`500`
`heart_beat_interval_ms`	ClickHouse Keeper 领导者向追随者发送心跳的频率 (毫秒)	`500`
`election_timeout_lower_bound_ms`	如果追随者在此间隔内未收到来自领导者的心跳，则它可以发起领导者选举。必须小于或等于 `election_timeout_upper_bound_ms`。理想情况下，它们不应相等。	`1000`
`election_timeout_upper_bound_ms`	如果追随者在此间隔内未收到来自领导者的心跳，则它必须发起领导者选举。	`2000`
`rotate_log_storage_interval`	在单个文件中存储多少个日志记录。	`100000`
`reserved_log_items`	在压缩之前存储多少个协调日志记录。	`100000`
`snapshot_distance`	ClickHouse Keeper 创建新快照的频率（以日志中的记录数计）。	`100000`
`snapshots_to_keep`	要保留多少个快照。	`3`
`stale_log_gap`	领导者认为追随者过时并在发送日志之前向其发送快照的阈值。	`10000`
`fresh_log_gap`	节点何时变为最新。	`200`
`max_requests_batch_size`	在发送到 RAFT 之前，请求计数中批处理的最大大小。	`100`
`force_sync`	对协调日志的每次写入调用 `fsync`。	`true`
`quorum_reads`	通过整个 RAFT 共识执行读取请求作为写入，速度相似。	`false`
`raft_logs_level`	关于协调的文本日志记录级别（trace、debug 等）。	`系统默认`
`auto_forwarding`	允许将写入请求从追随者转发到领导者。	`true`
`shutdown_timeout`	等待完成内部连接并关闭 (毫秒)。	`5000`
`startup_timeout`	如果服务器在指定的超时时间内未连接到其他仲裁参与者，则它将终止 (毫秒)。	`30000`
`async_replication`	启用异步复制。在保持所有写入和读取保证的同时，实现更好的性能。默认情况下禁用设置，以不破坏向后兼容性	`false`
`latest_logs_cache_size_threshold`	最新日志条目的内存缓存的最大总大小	`1GiB`
`commit_logs_cache_size_threshold`	提交下一个需要的日志条目的内存缓存的最大总大小	`500MiB`
`disk_move_retries_wait_ms`	在磁盘之间移动文件时发生故障后，重试之间等待的时间	`1000`
`disk_move_retries_during_init`	在初始化期间磁盘之间移动文件时发生故障后的重试次数	`100`
`experimental_use_rocksdb`	使用 rocksdb 作为后端存储	`0`

仲裁配置位于 <keeper_server>.<raft_configuration> 部分，包含服务器描述。

整个仲裁的唯一参数是 secure，它为仲裁参与者之间的通信启用加密连接。如果内部节点之间的通信需要 SSL 连接，则可以将参数设置为 true，否则可以保持未指定。

每个 <server> 的主要参数是

id — 仲裁中的服务器标识符。
hostname — 此服务器所在的主机名。
port — 此服务器监听连接的端口。
can_become_leader — 设置为 false 以将服务器设置为 learner。如果省略，则值为 true。

注意

如果您的 ClickHouse Keeper 集群的拓扑结构发生更改（例如，更换服务器），请确保保持 server_id 到 hostname 的映射一致，并避免为不同的服务器混用或重用现有的 server_id（例如，如果您依赖自动化脚本来部署 ClickHouse Keeper，则可能会发生这种情况）

如果 Keeper 实例的主机可以更改，我们建议定义和使用主机名而不是原始 IP 地址。更改主机名相当于删除并重新添加服务器，在某些情况下这是不可能做到的（例如，Keeper 实例不足以形成仲裁）。

注意

默认情况下禁用 async_replication，以避免破坏向后兼容性。如果您的集群中的所有 Keeper 实例都运行支持 async_replication 的版本（v23.9+），我们建议启用它，因为它可以在没有任何缺点的情况下提高性能。

具有三个节点的仲裁配置示例可以在 integration tests 中找到，前缀为 test_keeper_。服务器 #1 的配置示例

<keeper_server>
    <tcp_port>2181</tcp_port>
    <server_id>1</server_id>
    <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
    <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

    <coordination_settings>
        <operation_timeout_ms>10000</operation_timeout_ms>
        <session_timeout_ms>30000</session_timeout_ms>
        <raft_logs_level>trace</raft_logs_level>
    </coordination_settings>

    <raft_configuration>
        <server>
            <id>1</id>
            <hostname>zoo1</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>2</id>
            <hostname>zoo2</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>3</id>
            <hostname>zoo3</hostname>
            <port>9234</port>
        </server>
    </raft_configuration>
</keeper_server>

如何运行

ClickHouse Keeper 捆绑在 ClickHouse 服务器软件包中，只需将 <keeper_server> 的配置添加到您的 /etc/your_path_to_config/clickhouse-server/config.xml，并像往常一样启动 ClickHouse 服务器。如果您想独立运行 ClickHouse Keeper，您可以像这样启动它

clickhouse-keeper --config /etc/your_path_to_config/config.xml

如果您没有符号链接 (clickhouse-keeper)，您可以创建它或将 keeper 指定为 clickhouse 的参数

clickhouse keeper --config /etc/your_path_to_config/config.xml

四字命令

ClickHouse Keeper 还提供 4lw 命令，这些命令与 Zookeeper 几乎相同。每个命令由四个字母组成，例如 mntr、stat 等。还有一些更有趣的命令：stat 提供有关服务器和已连接客户端的一些常规信息，而 srvr 和 cons 分别提供有关服务器和连接的扩展详细信息。

4lw 命令具有白名单配置 four_letter_word_white_list，其默认值为 conf,cons,crst,envi,ruok,srst,srvr,stat,wchs,dirs,mntr,isro,rcvr,apiv,csnp,lgif,rqld,ydld。

您可以通过 telnet 或 nc 在客户端端口向 ClickHouse Keeper 发出命令。

echo mntr | nc localhost 9181

以下是详细的 4lw 命令

ruok：测试服务器是否在非错误状态下运行。如果服务器正在运行，则会响应 imok。否则，它将根本不响应。imok 的响应不一定表示服务器已加入仲裁，仅表示服务器进程处于活动状态并绑定到指定的客户端端口。使用“stat”获取有关仲裁状态和客户端连接信息的详细信息。

imok

mntr：输出可用于监控集群健康状况的变量列表。

zk_version      v21.11.1.1-prestable-7a4a0b0edef0ad6e0aa662cd3b90c3f4acf796e7
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received     68
zk_packets_sent 68
zk_num_alive_connections        1
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count  4
zk_watch_count  1
zk_ephemerals_count     0
zk_approximate_data_size        723
zk_open_file_descriptor_count   310
zk_max_file_descriptor_count    10240
zk_followers    0
zk_synced_followers     0

srvr：列出服务器的完整详细信息。

ClickHouse Keeper version: v21.11.1.1-prestable-7a4a0b0edef0ad6e0aa662cd3b90c3f4acf796e7
Latency min/avg/max: 0/0/0
Received: 2
Sent : 2
Connections: 1
Outstanding: 0
Zxid: 34
Mode: leader
Node count: 4

stat：列出服务器和已连接客户端的简要详细信息。

ClickHouse Keeper version: v21.11.1.1-prestable-7a4a0b0edef0ad6e0aa662cd3b90c3f4acf796e7
Clients:
 192.168.1.1:52852(recved=0,sent=0)
 192.168.1.1:52042(recved=24,sent=48)
Latency min/avg/max: 0/0/0
Received: 4
Sent : 4
Connections: 1
Outstanding: 0
Zxid: 36
Mode: leader
Node count: 4

srst：重置服务器统计信息。该命令将影响 srvr、mntr 和 stat 的结果。

Server stats reset.

conf：打印有关服务配置的详细信息。

server_id=1
tcp_port=2181
four_letter_word_white_list=*
log_storage_path=./coordination/logs
snapshot_storage_path=./coordination/snapshots
max_requests_batch_size=100
session_timeout_ms=30000
operation_timeout_ms=10000
dead_session_check_period_ms=500
heart_beat_interval_ms=500
election_timeout_lower_bound_ms=1000
election_timeout_upper_bound_ms=2000
reserved_log_items=1000000000000000
snapshot_distance=10000
auto_forwarding=true
shutdown_timeout=5000
startup_timeout=240000
raft_logs_level=information
snapshots_to_keep=3
rotate_log_storage_interval=100000
stale_log_gap=10000
fresh_log_gap=200
max_requests_batch_size=100
quorum_reads=false
force_sync=false
compress_logs=true
compress_snapshots_with_zstd_format=true
configuration_change_tries_count=20

cons：列出连接到此服务器的所有客户端的完整连接/会话详细信息。包括有关接收/发送的数据包数量、会话 ID、操作延迟、执行的最后一个操作等信息...

 192.168.1.1:52163(recved=0,sent=0,sid=0xffffffffffffffff,lop=NA,est=1636454787393,to=30000,lzxid=0xffffffffffffffff,lresp=0,llat=0,minlat=0,avglat=0,maxlat=0)
 192.168.1.1:52042(recved=9,sent=18,sid=0x0000000000000001,lop=List,est=1636454739887,to=30000,lcxid=0x0000000000000005,lzxid=0x0000000000000005,lresp=1636454739892,llat=0,minlat=0,avglat=0,maxlat=0)

crst：重置所有连接的连接/会话统计信息。

Connection stats reset.

envi：打印有关服务环境的详细信息

Environment:
clickhouse.keeper.version=v21.11.1.1-prestable-7a4a0b0edef0ad6e0aa662cd3b90c3f4acf796e7
host.name=ZBMAC-C02D4054M.local
os.name=Darwin
os.arch=x86_64
os.version=19.6.0
cpu.count=12
user.name=root
user.home=/Users/JackyWoo/
user.dir=/Users/JackyWoo/project/jd/clickhouse/cmake-build-debug/programs/
user.tmp=/var/folders/b4/smbq5mfj7578f2jzwn602tt40000gn/T/

dirs：显示快照和日志文件的总大小（以字节为单位）

snapshot_dir_size: 0
log_dir_size: 3875

isro：测试服务器是否在只读模式下运行。如果处于只读模式，服务器将响应 ro，如果未处于只读模式，则响应 rw。

rw

wchs：列出有关服务器监视的简要信息。

1 connections watching 1 paths
Total watches:1

wchc：按会话列出有关服务器监视的详细信息。这将输出会话（连接）列表以及关联的监视（路径）。注意，根据监视的数量，此操作可能会很昂贵（影响服务器性能），请谨慎使用。

0x0000000000000001
    /clickhouse/task_queue/ddl

wchp：按路径列出有关服务器监视的详细信息。这将输出路径 (znodes) 列表以及关联的会话。注意，根据监视的数量，此操作可能会很昂贵（即，影响服务器性能），请谨慎使用。

/clickhouse/task_queue/ddl
    0x0000000000000001

dump：列出未完成的会话和临时节点。这仅在领导者上有效。

Sessions dump (2):
0x0000000000000001
0x0000000000000002
Sessions with Ephemerals (1):
0x0000000000000001
 /clickhouse/task_queue/ddl

csnp：计划快照创建任务。如果成功，则返回计划快照的最后一个已提交日志索引，如果失败，则返回 Failed to schedule snapshot creation task.。请注意，lgif 命令可以帮助您确定快照是否完成。

lgif：Keeper 日志信息。first_log_idx：我在日志存储中的第一个日志索引；first_log_term：我的第一个日志术语；last_log_idx：我在日志存储中的最后一个日志索引；last_log_term：我的最后一个日志术语；last_committed_log_idx：我在状态机中的最后一个已提交日志索引；leader_committed_log_idx：从我的角度来看，领导者的已提交日志索引；target_committed_log_idx：应提交的目标日志索引；last_snapshot_idx：上一个快照中最大的已提交日志索引。

first_log_idx   1
first_log_term  1
last_log_idx    101
last_log_term   1
last_committed_log_idx  100
leader_committed_log_idx    101
target_committed_log_idx    101
last_snapshot_idx   50

rqld：请求成为新领导者。如果已发送请求，则返回 Sent leadership request to leader.，如果未发送请求，则返回 Failed to send leadership request to leader.。请注意，如果节点已经是领导者，则结果与发送请求相同。

Sent leadership request to leader.

ftfl：列出所有功能标志以及它们是否为 Keeper 实例启用。

filtered_list   1
multi_read  1
check_not_exists    0

ydld：请求放弃领导地位并成为追随者。如果接收请求的服务器是领导者，它将首先暂停写入操作，等待后继者（当前领导者永远不能成为后继者）完成最新日志的追赶，然后辞职。后继者将自动选择。如果已发送请求，则返回 Sent yield leadership request to leader.，如果未发送请求，则返回 Failed to send yield leadership request to leader.。请注意，如果节点已经是追随者，则结果与发送请求相同。

Sent yield leadership request to leader.

pfev：返回所有收集事件的值。对于每个事件，它返回事件名称、事件值和事件描述。

FileOpen	62	Number of files opened.
Seek	4	Number of times the 'lseek' function was called.
ReadBufferFromFileDescriptorRead	126	Number of reads (read/pread) from a file descriptor. Does not include sockets.
ReadBufferFromFileDescriptorReadFailed	0	Number of times the read (read/pread) from a file descriptor have failed.
ReadBufferFromFileDescriptorReadBytes	178846	Number of bytes read from file descriptors. If the file is compressed, this will show the compressed data size.
WriteBufferFromFileDescriptorWrite	7	Number of writes (write/pwrite) to a file descriptor. Does not include sockets.
WriteBufferFromFileDescriptorWriteFailed	0	Number of times the write (write/pwrite) to a file descriptor have failed.
WriteBufferFromFileDescriptorWriteBytes	153	Number of bytes written to file descriptors. If the file is compressed, this will show compressed data size.
FileSync	2	Number of times the F_FULLFSYNC/fsync/fdatasync function was called for files.
DirectorySync	0	Number of times the F_FULLFSYNC/fsync/fdatasync function was called for directories.
FileSyncElapsedMicroseconds	12756	Total time spent waiting for F_FULLFSYNC/fsync/fdatasync syscall for files.
DirectorySyncElapsedMicroseconds	0	Total time spent waiting for F_FULLFSYNC/fsync/fdatasync syscall for directories.
ReadCompressedBytes	0	Number of bytes (the number of bytes before decompression) read from compressed sources (files, network).
CompressedReadBufferBlocks	0	Number of compressed blocks (the blocks of data that are compressed independent of each other) read from compressed sources (files, network).
CompressedReadBufferBytes	0	Number of uncompressed bytes (the number of bytes after decompression) read from compressed sources (files, network).
AIOWrite	0	Number of writes with Linux or FreeBSD AIO interface
AIOWriteBytes	0	Number of bytes written with Linux or FreeBSD AIO interface
...

HTTP 控制

ClickHouse Keeper 提供 HTTP 接口来检查副本是否已准备好接收流量。它可以在云环境中使用，例如 Kubernetes。

启用 /ready 端点的配置示例

<clickhouse>
    <keeper_server>
        <http_control>
            <port>9182</port>
            <readiness>
                <endpoint>/ready</endpoint>
            </readiness>
        </http_control>
    </keeper_server>
</clickhouse>

功能标志

Keeper 完全兼容 ZooKeeper 及其客户端，但它也引入了一些独特的特性和请求类型，ClickHouse 客户端可以使用这些特性和请求类型。由于这些特性可能会引入向后不兼容的更改，因此大多数特性默认情况下处于禁用状态，可以使用 keeper_server.feature_flags 配置启用。
所有功能都可以显式禁用。
如果您想为 Keeper 集群启用新功能，我们建议您首先将集群中的所有 Keeper 实例更新到支持该功能的版本，然后再启用该功能本身。

禁用 multi_read 并启用 check_not_exists 的功能标志配置示例

<clickhouse>
    <keeper_server>
        <feature_flags>
            <multi_read>0</multi_read>
            <check_not_exists>1</check_not_exists>
        </feature_flags>
    </keeper_server>
</clickhouse>

以下功能可用

multi_read - 支持读取多请求。默认值：1
filtered_list - 支持列表请求，该请求按节点类型（临时或持久）过滤结果。默认值：1
check_not_exists - 支持 CheckNotExists 请求，该请求断言节点不存在。默认值：0
create_if_not_exists - 支持 CreateIfNotExists 请求，如果节点不存在，则尝试创建节点。如果它存在，则不应用任何更改并返回 ZOK。默认值：0

从 ZooKeeper 迁移

无法从 ZooKeeper 无缝迁移到 ClickHouse Keeper。您必须停止 ZooKeeper 集群，转换数据，然后启动 ClickHouse Keeper。clickhouse-keeper-converter 工具允许将 ZooKeeper 日志和快照转换为 ClickHouse Keeper 快照。它仅适用于 ZooKeeper > 3.4。迁移步骤

停止所有 ZooKeeper 节点。
可选，但建议：查找 ZooKeeper 领导者节点，启动并再次停止它。这将强制 ZooKeeper 创建一致的快照。
在领导者上运行 clickhouse-keeper-converter，例如

clickhouse-keeper-converter --zookeeper-logs-dir /var/lib/zookeeper/version-2 --zookeeper-snapshots-dir /var/lib/zookeeper/version-2 --output-dir /path/to/clickhouse/keeper/snapshots

将快照复制到配置了 keeper 的 ClickHouse 服务器节点，或启动 ClickHouse Keeper 代替 ZooKeeper。快照必须在所有节点上持久存在，否则，空节点可能会更快，其中一个节点可能会成为领导者。

注意

keeper-converter 工具在 Keeper 独立二进制文件中不可用。
如果您已安装 ClickHouse，则可以直接使用该二进制文件

clickhouse keeper-converter ...

否则，您可以下载二进制文件，并按照上述说明运行该工具，而无需安装 ClickHouse。

在丢失仲裁后恢复

由于 ClickHouse Keeper 使用 Raft，因此它可以容忍一定数量的节点崩溃，具体取决于集群大小。
例如，对于 3 节点集群，如果仅 1 个节点崩溃，它将继续正常工作。

集群配置可以动态配置，但存在一些限制。重新配置也依赖于 Raft，因此要从集群中添加/删除节点，您需要具有仲裁。如果您同时在集群中丢失了太多节点，而没有任何机会再次启动它们，则 Raft 将停止工作，并且不允许您使用传统方式重新配置集群。

尽管如此，ClickHouse Keeper 具有恢复模式，允许您仅使用 1 个节点强制重新配置集群。只有在您无法再次启动节点或在同一端点上启动新实例时，才应将其作为最后的手段。

继续之前需要注意的重要事项

确保失败的节点无法再次连接到集群。
在步骤中指定之前，请勿启动任何新节点。

在确保上述事项为真后，您需要执行以下操作

选择一个 Keeper 节点作为您的新领导者。请注意，该节点的数据将用于整个集群，因此我们建议使用状态最新的节点。
在执行任何其他操作之前，请备份所选节点的 log_storage_path 和 snapshot_storage_path 文件夹。
在您要使用的所有节点上重新配置集群。
向您选择的节点发送四字命令 rcvr，这将使节点进入恢复模式，或者停止所选节点上的 Keeper 实例，并使用 --force-recovery 参数再次启动它。
逐个启动新节点上的 Keeper 实例，确保在启动下一个节点之前，mntr 为 zk_server_state 返回 follower。
在恢复模式下，领导者节点将为 mntr 命令返回错误消息，直到它与新节点达成仲裁并拒绝来自客户端和追随者的任何请求。
在达成仲裁后，领导者节点将返回正常运行模式，接受使用 Raft 验证的所有请求，其中 mntr 应为 zk_server_state 返回 leader。

将磁盘与 Keeper 一起使用

Keeper 支持外部磁盘的子集，用于存储快照、日志文件和状态文件。

支持的磁盘类型包括

s3_plain
s3
local

以下是配置文件中包含的磁盘定义的示例。

<clickhouse>
    <storage_configuration>
        <disks>
            <log_local>
                <type>local</type>
                <path>/var/lib/clickhouse/coordination/logs/</path>
            </log_local>
            <log_s3_plain>
                <type>s3_plain</type>
                <endpoint>https://some_s3_endpoint/logs/</endpoint>
                <access_key_id>ACCESS_KEY</access_key_id>
                <secret_access_key>SECRET_KEY</secret_access_key>
            </log_s3_plain>
            <snapshot_local>
                <type>local</type>
                <path>/var/lib/clickhouse/coordination/snapshots/</path>
            </snapshot_local>
            <snapshot_s3_plain>
                <type>s3_plain</type>
                <endpoint>https://some_s3_endpoint/snapshots/</endpoint>
                <access_key_id>ACCESS_KEY</access_key_id>
                <secret_access_key>SECRET_KEY</secret_access_key>
            </snapshot_s3_plain>
            <state_s3_plain>
                <type>s3_plain</type>
                <endpoint>https://some_s3_endpoint/state/</endpoint>
                <access_key_id>ACCESS_KEY</access_key_id>
                <secret_access_key>SECRET_KEY</secret_access_key>
            </state_s3_plain>
        </disks>
    </storage_configuration>
</clickhouse>

要将磁盘用于日志，应将 keeper_server.log_storage_disk 配置设置为磁盘名称。
要将磁盘用于快照，应将 keeper_server.snapshot_storage_disk 配置设置为磁盘名称。
此外，通过分别使用 keeper_server.latest_log_storage_disk 和 keeper_server.latest_snapshot_storage_disk，可以将不同的磁盘用于最新的日志或快照。
在这种情况下，当创建新日志或快照时，Keeper 会自动将文件移动到正确的磁盘。要将磁盘用于状态文件，应将 keeper_server.state_storage_disk 配置设置为磁盘名称。

在磁盘之间移动文件是安全的，如果 Keeper 在传输过程中停止，则不会有丢失数据的风险。在文件完全移动到新磁盘之前，不会从旧磁盘中删除该文件。

将 keeper_server.coordination_settings.force_sync 设置为 true（默认为 true）的 Keeper 无法满足所有磁盘类型的一些保证。
目前，只有 local 类型的磁盘支持持久同步。
如果使用 force_sync，则在未使用 latest_log_storage_disk 的情况下，log_storage_disk 应为 local 磁盘。
如果使用 latest_log_storage_disk，则它应始终为 local 磁盘。
如果禁用 force_sync，则所有类型的磁盘都可以在任何设置中使用。

Keeper 实例的可能存储设置可能如下所示

<clickhouse>
    <keeper_server>
        <log_storage_disk>log_s3_plain</log_storage_disk>
        <latest_log_storage_disk>log_local</latest_log_storage_disk>

        <snapshot_storage_disk>snapshot_s3_plain</snapshot_storage_disk>
        <latest_snapshot_storage_disk>snapshot_local</latest_snapshot_storage_disk>
    </keeper_server>
</clickhouse>

此实例会将除最新日志之外的所有日志存储在磁盘 log_s3_plain 上，而最新日志将存储在 log_local 磁盘上。
快照也应用相同的逻辑，除最新快照之外的所有快照都将存储在 snapshot_s3_plain 上，而最新快照将存储在 snapshot_local 磁盘上。

更改磁盘设置

信息

在应用新的磁盘设置之前，请手动备份所有 Keeper 日志和快照。

如果定义了分层磁盘设置（为最新文件使用单独的磁盘），则 Keeper 将尝试在启动时自动将文件移动到正确的磁盘。
应用了与之前相同的保证；在文件完全移动到新磁盘之前，不会从旧磁盘中删除该文件，因此可以安全地进行多次重启。

如果需要将文件移动到全新的磁盘（或从 2 磁盘设置移动到单磁盘设置），则可以使用 keeper_server.old_snapshot_storage_disk 和 keeper_server.old_log_storage_disk 的多个定义。

以下配置显示了我们如何从之前的 2 磁盘设置移动到全新的单磁盘设置

<clickhouse>
    <keeper_server>
        <old_log_storage_disk>log_local</old_log_storage_disk>
        <old_log_storage_disk>log_s3_plain</old_log_storage_disk>
        <log_storage_disk>log_local2</log_storage_disk>

        <old_snapshot_storage_disk>snapshot_s3_plain</old_snapshot_storage_disk>
        <old_snapshot_storage_disk>snapshot_local</old_snapshot_storage_disk>
        <snapshot_storage_disk>snapshot_local2</snapshot_storage_disk>
    </keeper_server>
</clickhouse>

在启动时，所有日志文件都将从 log_local 和 log_s3_plain 移动到 log_local2 磁盘。
此外，所有快照文件将从 snapshot_local 和 snapshot_s3_plain 移动到 snapshot_local2 磁盘。

配置日志缓存

为了最大限度地减少从磁盘读取的数据量，Keeper 会将日志条目缓存在内存中。
如果请求很大，日志条目将占用过多内存，因此缓存的日志量会受到限制。
此限制由以下两个配置控制

latest_logs_cache_size_threshold - 缓存在内存中的最新日志的总大小
commit_logs_cache_size_threshold - 需要接下来提交的后续日志的总大小

如果默认值太大，您可以通过减少这两个配置来降低内存使用率。

注意

您可以使用 pfev 命令来检查从每个缓存和文件读取的日志量。
您还可以使用 Prometheus 端点中的指标来跟踪两个缓存的当前大小。

Prometheus

Keeper 可以公开指标数据，以便从 Prometheus 进行抓取。

设置

endpoint – Prometheus 服务器抓取指标的 HTTP 端点。从“/”开始。
port – endpoint 的端口。
metrics – 标志，设置为公开来自 system.metrics 表的指标。
events – 标志，设置为公开来自 system.events 表的指标。
asynchronous_metrics – 标志，设置为公开来自 system.asynchronous_metrics 表的当前指标值。

示例

<clickhouse>
    <listen_host>0.0.0.0</listen_host>
    <http_port>8123</http_port>
    <tcp_port>9000</tcp_port>
    <prometheus>
        <endpoint>/metrics</endpoint>
        <port>9363</port>
        <metrics>true</metrics>
        <events>true</events>
        <asynchronous_metrics>true</asynchronous_metrics>
    </prometheus>
</clickhouse>

检查（将 127.0.0.1 替换为您的 ClickHouse 服务器的 IP 地址或主机名）

curl 127.0.0.1:9363/metrics

另请参阅 ClickHouse Cloud Prometheus 集成。

ClickHouse Keeper 用户指南

本指南提供了配置 ClickHouse Keeper 的简单且最少的设置，并提供了一个关于如何测试分布式操作的示例。此示例使用 Linux 上的 3 个节点执行。

1. 使用 Keeper 设置配置节点

在 3 个主机（chnode1、chnode2、chnode3）上安装 3 个 ClickHouse 实例。（查看快速入门以获取有关安装 ClickHouse 的详细信息。）
在每个节点上，添加以下条目以允许通过网络接口进行外部通信。
```
<listen_host>0.0.0.0</listen_host>
```

将以下 ClickHouse Keeper 配置添加到所有三台服务器，并为每台服务器更新 <server_id> 设置；对于 chnode1 将为 1，chnode2 将为 2，依此类推。

<keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>
    <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
    <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

    <coordination_settings>
        <operation_timeout_ms>10000</operation_timeout_ms>
        <session_timeout_ms>30000</session_timeout_ms>
        <raft_logs_level>warning</raft_logs_level>
    </coordination_settings>

    <raft_configuration>
        <server>
            <id>1</id>
            <hostname>chnode1.domain.com</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>2</id>
            <hostname>chnode2.domain.com</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>3</id>
            <hostname>chnode3.domain.com</hostname>
            <port>9234</port>
        </server>
    </raft_configuration>
</keeper_server>

以下是上面使用的基本设置

参数	描述	示例
tcp_port	keeper 客户端使用的端口	9181 默认等效于 zookeeper 中的 2181
server_id	用于 raft 配置的每个 ClickHouse Keeper 服务器的标识符	1
coordination_settings	参数部分，例如超时	timeouts: 10000, log level: trace
server	参与服务器的定义	每个服务器定义的列表
raft_configuration	keeper 集群中每个服务器的设置	每个服务器的服务器和设置
id	keeper 服务的服务器数字 ID	1
hostname	keeper 集群中每个服务器的主机名、IP 或 FQDN	`chnode1.domain.com`
port	用于监听服务器间 keeper 连接的端口	9234

启用 Zookeeper 组件。它将使用 ClickHouse Keeper 引擎

    <zookeeper>
        <node>
            <host>chnode1.domain.com</host>
            <port>9181</port>
        </node>
        <node>
            <host>chnode2.domain.com</host>
            <port>9181</port>
        </node>
        <node>
            <host>chnode3.domain.com</host>
            <port>9181</port>
        </node>
    </zookeeper>

以下是上面使用的基本设置

参数	描述	示例
node	ClickHouse Keeper 连接的节点列表	每个服务器的 settings 条目
host	每个 ClickHouse keeper 节点的主机名、IP 或 FQDN	`chnode1.domain.com`
port	ClickHouse Keeper 客户端端口	9181

重启 ClickHouse 并验证每个 Keeper 实例都在运行。在每台服务器上执行以下命令。如果 Keeper 正在运行且健康，则 ruok 命令返回 imok
```
# echo ruok | nc localhost 9181; echo
imok
```

system 数据库有一个名为 zookeeper 的表，其中包含您的 ClickHouse Keeper 实例的详细信息。让我们查看该表

SELECT *
FROM system.zookeeper
WHERE path IN ('/', '/clickhouse')

该表看起来像

┌─name───────┬─value─┬─czxid─┬─mzxid─┬───────────────ctime─┬───────────────mtime─┬─version─┬─cversion─┬─aversion─┬─ephemeralOwner─┬─dataLength─┬─numChildren─┬─pzxid─┬─path────────┐
│ clickhouse │       │   124 │   124 │ 2022-03-07 00:49:34 │ 2022-03-07 00:49:34 │       0 │        2 │        0 │              0 │          0 │           2 │  5693 │ /           │
│ task_queue │       │   125 │   125 │ 2022-03-07 00:49:34 │ 2022-03-07 00:49:34 │       0 │        1 │        0 │              0 │          0 │           1 │   126 │ /clickhouse │
│ tables     │       │  5693 │  5693 │ 2022-03-07 00:49:34 │ 2022-03-07 00:49:34 │       0 │        3 │        0 │              0 │          0 │           3 │  6461 │ /clickhouse │
└────────────┴───────┴───────┴───────┴─────────────────────┴─────────────────────┴─────────┴──────────┴──────────┴────────────────┴────────────┴─────────────┴───────┴─────────────┘

2. 在 ClickHouse 中配置集群

让我们配置一个简单的集群，其中包含 2 个分片，并且只有 2 个节点上的一个副本。第三个节点将用于实现 ClickHouse Keeper 中要求的仲裁。在 chnode1 和 chnode2 上更新配置。以下集群在每个节点上定义 1 个分片，总共 2 个分片，没有副本。在此示例中，某些数据将位于一个节点上，而某些数据将位于另一个节点上

    <remote_servers>
        <cluster_2S_1R>
            <shard>
                <replica>
                    <host>chnode1.domain.com</host>
                    <port>9000</port>
                    <user>default</user>
                    <password>ClickHouse123!</password>
                </replica>
            </shard>
            <shard>
                <replica>
                    <host>chnode2.domain.com</host>
                    <port>9000</port>
                    <user>default</user>
                    <password>ClickHouse123!</password>
                </replica>
            </shard>
        </cluster_2S_1R>
    </remote_servers>

参数	描述	示例
shard	集群定义中副本的列表	每个分片的副本列表
replica	每个副本的设置列表	每个副本的 settings 条目
host	将托管副本分片的服务器的主机名、IP 或 FQDN	`chnode1.domain.com`
port	用于使用本机 tcp 协议进行通信的端口	9000
user	将用于对集群实例进行身份验证的用户名	default
password	用于允许连接到集群实例的已定义用户的密码	`ClickHouse123!`

重启 ClickHouse 并验证集群是否已创建

SHOW clusters;

您应该看到您的集群

┌─cluster───────┐
│ cluster_2S_1R │
└───────────────┘

3. 创建和测试分布式表

使用 chnode1 上的 ClickHouse 客户端在新集群上创建一个新数据库。ON CLUSTER 子句会自动在两个节点上创建数据库。
```
CREATE DATABASE db1 ON CLUSTER 'cluster_2S_1R';
```

在 db1 数据库上创建一个新表。同样，ON CLUSTER 会在两个节点上创建表。

CREATE TABLE db1.table1 on cluster 'cluster_2S_1R'
(
    `id` UInt64,
    `column1` String
)
ENGINE = MergeTree
ORDER BY column1

在 chnode1 节点上，添加几行数据

INSERT INTO db1.table1
    (id, column1)
VALUES
    (1, 'abc'),
    (2, 'def')

在 chnode2 节点上，添加几行数据

INSERT INTO db1.table1
    (id, column1)
VALUES
    (3, 'ghi'),
    (4, 'jkl')

请注意，在每个节点上运行 SELECT 语句仅显示该节点上的数据。例如，在 chnode1 上

SELECT *
FROM db1.table1

Query id: 7ef1edbc-df25-462b-a9d4-3fe6f9cb0b6d

┌─id─┬─column1─┐
│  1 │ abc     │
│  2 │ def     │
└────┴─────────┘

2 rows in set. Elapsed: 0.006 sec.

在 chnode2 上

SELECT *
FROM db1.table1

Query id: c43763cc-c69c-4bcc-afbe-50e764adfcbf

┌─id─┬─column1─┐
│  3 │ ghi     │
│  4 │ jkl     │
└────┴─────────┘

您可以创建一个 Distributed 表来表示两个分片上的数据。使用 Distributed 表引擎的表不存储任何自己的数据，但允许在多个服务器上进行分布式查询处理。读取会命中所有分片，写入可以分布在各个分片上。在 chnode1 上运行以下查询
```
CREATE TABLE db1.dist_table (
    id UInt64,
    column1 String
)
ENGINE = Distributed(cluster_2S_1R,db1,table1)
```

请注意，查询 dist_table 会返回来自两个分片的所有四行数据

SELECT *
FROM db1.dist_table

Query id: 495bffa0-f849-4a0c-aeea-d7115a54747a

┌─id─┬─column1─┐
│  1 │ abc     │
│  2 │ def     │
└────┴─────────┘
┌─id─┬─column1─┐
│  3 │ ghi     │
│  4 │ jkl     │
└────┴─────────┘

4 rows in set. Elapsed: 0.018 sec.

总结

本指南演示了如何使用 ClickHouse Keeper 设置集群。借助 ClickHouse Keeper，您可以配置集群并定义可以跨分片复制的分布式表。

使用唯一路径配置 ClickHouse Keeper

ClickHouse Cloud 中不支持

注意

此页面不适用于 ClickHouse Cloud。此处记录的程序在 ClickHouse Cloud 服务中是自动化的。

描述

本文介绍了如何使用内置的 {uuid} 宏设置在 ClickHouse Keeper 或 ZooKeeper 中创建唯一条目。唯一路径有助于频繁创建和删除表，因为这避免了等待数分钟 Keeper 垃圾回收来删除路径条目，因为每次创建路径时，都会在该路径中使用新的 uuid；路径永远不会被重用。

示例环境

一个三节点集群，将被配置为在所有三个节点上都具有 ClickHouse Keeper，并在其中两个节点上具有 ClickHouse。这为 ClickHouse Keeper 提供了三个节点（包括一个决胜节点），以及一个由两个副本组成的单个 ClickHouse 分片。

node	描述
`chnode1.marsnet.local`	数据节点 - 集群 `cluster_1S_2R`
`chnode2.marsnet.local`	数据节点 - 集群 `cluster_1S_2R`
`chnode3.marsnet.local`	ClickHouse Keeper 决胜节点

集群的示例配置

    <remote_servers>
        <cluster_1S_2R>
            <shard>
                <replica>
                    <host>chnode1.marsnet.local</host>
                    <port>9440</port>
                    <user>default</user>
                    <password>ClickHouse123!</password>
                    <secure>1</secure>
                </replica>
                <replica>
                    <host>chnode2.marsnet.local</host>
                    <port>9440</port>
                    <user>default</user>
                    <password>ClickHouse123!</password>
                    <secure>1</secure>
                </replica>
            </shard>
        </cluster_1S_2R>
    </remote_servers>

设置表以使用 `{uuid}` 的步骤

在每台服务器上配置宏，服务器 1 的示例

    <macros>
        <shard>1</shard>
        <replica>replica_1</replica>
    </macros>

注意

请注意，我们为 shard 和 replica 定义了宏，但 {uuid} 未在此处定义，它是内置的，无需定义。

创建数据库

CREATE DATABASE db_uuid
      ON CLUSTER 'cluster_1S_2R'
      ENGINE Atomic;

CREATE DATABASE db_uuid ON CLUSTER cluster_1S_2R
ENGINE = Atomic

Query id: 07fb7e65-beb4-4c30-b3ef-bd303e5c42b5

┌─host──────────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ chnode2.marsnet.local │ 9440 │      0 │       │                   1 │                0 │
│ chnode1.marsnet.local │ 9440 │      0 │       │                   0 │                0 │
└───────────────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘

使用宏和 {uuid} 在集群上创建表

CREATE TABLE db_uuid.uuid_table1 ON CLUSTER 'cluster_1S_2R'
   (
     id UInt64,
     column1 String
   )
   ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/db_uuid/{uuid}', '{replica}' )
   ORDER BY (id);

CREATE TABLE db_uuid.uuid_table1 ON CLUSTER cluster_1S_2R
(
    `id` UInt64,
    `column1` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/db_uuid/{uuid}', '{replica}')
ORDER BY id

Query id: 8f542664-4548-4a02-bd2a-6f2c973d0dc4

┌─host──────────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ chnode1.marsnet.local │ 9440 │      0 │       │                   1 │                0 │
│ chnode2.marsnet.local │ 9440 │      0 │       │                   0 │                0 │
└───────────────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘

创建分布式表

create table db_uuid.dist_uuid_table1 on cluster 'cluster_1S_2R'
   (
     id UInt64,
     column1 String
   )
   ENGINE = Distributed('cluster_1S_2R', 'db_uuid', 'uuid_table1' );

CREATE TABLE db_uuid.dist_uuid_table1 ON CLUSTER cluster_1S_2R
(
    `id` UInt64,
    `column1` String
)
ENGINE = Distributed('cluster_1S_2R', 'db_uuid', 'uuid_table1')

Query id: 3bc7f339-ab74-4c7d-a752-1ffe54219c0e

┌─host──────────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ chnode2.marsnet.local │ 9440 │      0 │       │                   1 │                0 │
│ chnode1.marsnet.local │ 9440 │      0 │       │                   0 │                0 │
└───────────────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘

测试

将数据插入第一个节点（例如 chnode1）

INSERT INTO db_uuid.uuid_table1
   ( id, column1)
   VALUES
   ( 1, 'abc');

INSERT INTO db_uuid.uuid_table1 (id, column1) FORMAT Values

Query id: 0f178db7-50a6-48e2-9a1b-52ed14e6e0f9

Ok.

1 row in set. Elapsed: 0.033 sec.

将数据插入第二个节点（例如 chnode2）

INSERT INTO db_uuid.uuid_table1
   ( id, column1)
   VALUES
   ( 2, 'def');

INSERT INTO db_uuid.uuid_table1 (id, column1) FORMAT Values

Query id: edc6f999-3e7d-40a0-8a29-3137e97e3607

Ok.

1 row in set. Elapsed: 0.529 sec.

使用分布式表查看记录

SELECT * FROM db_uuid.dist_uuid_table1;

SELECT *
FROM db_uuid.dist_uuid_table1

Query id: 6cbab449-9e7f-40fe-b8c2-62d46ba9f5c8

┌─id─┬─column1─┐
│  1 │ abc     │
└────┴─────────┘
┌─id─┬─column1─┐
│  2 │ def     │
└────┴─────────┘

2 rows in set. Elapsed: 0.007 sec.

替代方案

默认复制路径可以预先通过宏定义，也可以使用 {uuid}

设置每个节点上表的默认值

<default_replica_path>/clickhouse/tables/{shard}/db_uuid/{uuid}</default_replica_path>
<default_replica_name>{replica}</default_replica_name>

提示

如果节点用于某些数据库，您还可以在每个节点上定义宏 {database}。

创建没有显式参数的表

CREATE TABLE db_uuid.uuid_table1 ON CLUSTER 'cluster_1S_2R'
   (
     id UInt64,
     column1 String
   )
   ENGINE = ReplicatedMergeTree
   ORDER BY (id);

CREATE TABLE db_uuid.uuid_table1 ON CLUSTER cluster_1S_2R
(
    `id` UInt64,
    `column1` String
)
ENGINE = ReplicatedMergeTree
ORDER BY id

Query id: ab68cda9-ae41-4d6d-8d3b-20d8255774ee

┌─host──────────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ chnode2.marsnet.local │ 9440 │      0 │       │                   1 │                0 │
│ chnode1.marsnet.local │ 9440 │      0 │       │                   0 │                0 │
└───────────────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘

2 rows in set. Elapsed: 1.175 sec.

验证它是否使用了默认配置中使用的设置

SHOW CREATE TABLE db_uuid.uuid_table1;

SHOW CREATE TABLE db_uuid.uuid_table1

Query id: 5925ecce-a54f-47d8-9c3a-ad3257840c9e

┌─statement────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE db_uuid.uuid_table1
(
    `id` UInt64,
    `column1` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/db_uuid/{uuid}', '{replica}')
ORDER BY id
SETTINGS index_granularity = 8192 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.003 sec.

故障排除

获取表信息和 UUID 的示例命令

SELECT * FROM system.tables
WHERE database = 'db_uuid' AND name = 'uuid_table1';

获取有关 zookeeper 中具有上述表的 UUID 的表的信息的示例命令

SELECT * FROM system.zookeeper
WHERE path = '/clickhouse/tables/1/db_uuid/9e8a3cc2-0dec-4438-81a7-c3e63ce2a1cf/replicas';

注意

数据库必须是 Atomic 类型，如果从以前的版本升级，则 default 数据库很可能是 Ordinary 类型。

要检查：例如，

SELECT name, engine FROM system.databases WHERE name = 'db_uuid';

SELECT
    name,
    engine
FROM system.databases
WHERE name = 'db_uuid'

Query id: b047d459-a1d2-4016-bcf9-3e97e30e49c2

┌─name────┬─engine─┐
│ db_uuid │ Atomic │
└─────────┴────────┘

1 row in set. Elapsed: 0.004 sec.

ClickHouse Keeper 动态重新配置

ClickHouse Cloud 中不支持

注意

此页面不适用于 ClickHouse Cloud。此处记录的程序在 ClickHouse Cloud 服务中是自动化的。

描述

如果 keeper_server.enable_reconfiguration 已打开，ClickHouse Keeper 部分支持 ZooKeeper reconfig 命令，用于动态集群重新配置。

注意

如果此设置已关闭，您可以手动更改副本的 raft_configuration 部分来重新配置集群。确保您编辑所有副本上的文件，因为只有领导者才会应用更改。或者，您可以通过任何与 ZooKeeper 兼容的客户端发送 reconfig 查询。

虚拟节点 /keeper/config 以以下格式包含上次提交的集群配置

server.id = server_host:server_port[;server_type][;server_priority]
server.id2 = ...
...

每个服务器条目都以换行符分隔。
server_type 要么是 participant，要么是 learner（learner 不参与领导者选举）。
server_priority 是一个非负整数，用于告知在领导者选举中应优先考虑哪些节点。优先级为 0 意味着服务器永远不会成为领导者。

示例

:) get /keeper/config
server.1=zoo1:9234;participant;1
server.2=zoo2:9234;participant;1
server.3=zoo3:9234;participant;1

您可以使用 reconfig 命令添加新服务器、删除现有服务器以及更改现有服务器的优先级，以下是一些示例（使用 clickhouse-keeper-client）

# Add two new servers
reconfig add "server.5=localhost:123,server.6=localhost:234;learner"
# Remove two other servers
reconfig remove "3,4"
# Change existing server priority to 8
reconfig add "server.5=localhost:5123;participant;8"

以下是 kazoo 的示例

# Add two new servers, remove two other servers
reconfig(joining="server.5=localhost:123,server.6=localhost:234;learner", leaving="3,4")

# Change existing server priority to 8
reconfig(joining="server.5=localhost:5123;participant;8", leaving=None)

joining 中的服务器应采用上述服务器格式。服务器条目应以逗号分隔。添加新服务器时，您可以省略 server_priority（默认值为 1）和 server_type（默认值为 participant）。

如果要更改现有服务器优先级，请将其添加到具有目标优先级的 joining 中。服务器主机、端口和类型必须与现有服务器配置相同。

服务器按照在 joining 和 leaving 中出现的顺序添加和删除。joining 中的所有更新都在处理 leaving 中的更新之前处理。

Keeper 重新配置实现中存在一些注意事项

仅支持增量重新配置。具有非空 new_members 的请求将被拒绝。

ClickHouse Keeper 实现依赖于 NuRaft API 来动态更改成员资格。NuRaft 有一种一次添加单个服务器或删除单个服务器的方法。这意味着对配置的每个更改（joining 的每个部分，leaving 的每个部分）都必须单独决定。因此，没有可用的批量重新配置，因为这会误导最终用户。

更改服务器类型（participant/learner）也是不可能的，因为它不受 NuRaft 支持，唯一的方法是删除和添加服务器，但这又会产生误导。
您不能使用返回的 znodestat 值。
不使用 from_version 字段。所有设置了 from_version 的请求都将被拒绝。这是因为 /keeper/config 是一个虚拟节点，这意味着它不存储在持久存储中，而是为每个请求使用指定的节点配置动态生成的。做出此决定的目的是不重复数据，因为 NuRaft 已经存储了此配置。
与 ZooKeeper 不同，无法通过提交 sync 命令来等待集群重新配置。新配置将最终应用，但没有时间保证。
reconfig 命令可能会因各种原因而失败。您可以检查集群的状态，看看更新是否已应用。

将单节点 keeper 转换为集群

有时需要将实验性 keeper 节点扩展为集群。以下是将 3 节点集群逐步执行的方案

重要提示：新节点必须分批添加，批次大小小于当前仲裁数，否则它们将在它们之间选举领导者。在此示例中，一次添加一个。
现有的 keeper 节点必须启用 keeper_server.enable_reconfiguration 配置参数。
启动第二个节点，并使用 keeper 集群的完整新配置。
启动后，使用 reconfig 将其添加到节点 1。
现在，启动第三个节点，并使用 reconfig 将其添加。
通过在其中添加新的 keeper 节点来更新 clickhouse-server 配置，然后重启它以应用更改。
更新节点 1 的 raft 配置，并选择性地重启它。

为了熟悉该过程，这里有一个 sandbox 仓库。

不支持的功能

虽然 ClickHouse Keeper 旨在与 ZooKeeper 完全兼容，但目前仍有一些功能尚未实现（尽管开发正在进行中）

create 不支持返回 Stat 对象
create 不支持 TTL
addWatch 不适用于 PERSISTENT 监视
removeWatch 和 removeAllWatches 不受支持
setWatches 不受支持
不支持创建 CONTAINER 类型的 znode
SASL authentication 不受支持

实现细节​

配置​

Keeper 配置设置​

内部协调设置​

如何运行​

四字命令​

HTTP 控制​

功能标志​

从 ZooKeeper 迁移​

在丢失仲裁后恢复​

将磁盘与 Keeper 一起使用​

更改磁盘设置​

配置日志缓存​

Prometheus​

ClickHouse Keeper 用户指南​

1. 使用 Keeper 设置配置节点​

2. 在 ClickHouse 中配置集群​

3. 创建和测试分布式表​

总结​

使用唯一路径配置 ClickHouse Keeper​

描述​

示例环境​

设置表以使用 {uuid} 的步骤​

测试​

替代方案​

故障排除​

ClickHouse Keeper 动态重新配置​

描述​

将单节点 keeper 转换为集群​

不支持的功能​

实现细节

配置

Keeper 配置设置

内部协调设置

如何运行

四字命令

HTTP 控制

功能标志

从 ZooKeeper 迁移

在丢失仲裁后恢复

将磁盘与 Keeper 一起使用

更改磁盘设置

配置日志缓存

Prometheus

ClickHouse Keeper 用户指南

1. 使用 Keeper 设置配置节点

2. 在 ClickHouse 中配置集群

3. 创建和测试分布式表

总结

使用唯一路径配置 ClickHouse Keeper

描述

示例环境

设置表以使用 `{uuid}` 的步骤

测试

替代方案

故障排除

ClickHouse Keeper 动态重新配置

描述

将单节点 keeper 转换为集群

不支持的功能