ReplacingMergeTree

该引擎与 MergeTree 的不同之处在于，它删除了具有相同排序键值 (ORDER BY 表部分，而不是 PRIMARY KEY) 的重复条目。

数据去重仅在合并期间发生。合并在后台未知时间发生，因此您无法计划。部分数据可能仍未处理。虽然您可以使用 OPTIMIZE 查询运行计划外合并，但不要指望使用它，因为 OPTIMIZE 查询将读取和写入大量数据。

因此，ReplacingMergeTree 适用于在后台清除重复数据以节省空间，但不保证没有重复项。

注意

关于 ReplacingMergeTree 的详细指南，包括最佳实践以及如何优化性能，请访问此处。

创建表

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE = ReplacingMergeTree([ver [, is_deleted]])
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[SETTINGS name=value, ...]

有关请求参数的描述，请参阅语句描述。

注意

行的唯一性由 ORDER BY 表部分确定，而不是 PRIMARY KEY。

ReplacingMergeTree 参数

ver

ver — 带有版本号的列。类型 UInt*、Date、DateTime 或 DateTime64。可选参数。

合并时，ReplacingMergeTree 从所有具有相同排序键的行中仅保留一个

选择中的最后一个，如果 ver 未设置。选择是在参与合并的一组部分中的一组行。最近创建的部分（最后插入）将是选择中的最后一个。因此，去重后，最近插入的最后一行将保留用于每个唯一的排序键。
如果指定了 ver，则保留最大版本。如果 ver 对于多行相同，则将使用“如果 ver 未指定”规则，即最近插入的行将保留。

示例

-- without ver - the last inserted 'wins'
CREATE TABLE myFirstReplacingMT
(
    `key` Int64,
    `someCol` String,
    `eventTime` DateTime
)
ENGINE = ReplacingMergeTree
ORDER BY key;

INSERT INTO myFirstReplacingMT Values (1, 'first', '2020-01-01 01:01:01');
INSERT INTO myFirstReplacingMT Values (1, 'second', '2020-01-01 00:00:00');

SELECT * FROM myFirstReplacingMT FINAL;

┌─key─┬─someCol─┬───────────eventTime─┐
│   1 │ second  │ 2020-01-01 00:00:00 │
└─────┴─────────┴─────────────────────┘


-- with ver - the row with the biggest ver 'wins'
CREATE TABLE mySecondReplacingMT
(
    `key` Int64,
    `someCol` String,
    `eventTime` DateTime
)
ENGINE = ReplacingMergeTree(eventTime)
ORDER BY key;

INSERT INTO mySecondReplacingMT Values (1, 'first', '2020-01-01 01:01:01');
INSERT INTO mySecondReplacingMT Values (1, 'second', '2020-01-01 00:00:00');

SELECT * FROM mySecondReplacingMT FINAL;

┌─key─┬─someCol─┬───────────eventTime─┐
│   1 │ first   │ 2020-01-01 01:01:01 │
└─────┴─────────┴─────────────────────┘

is_deleted

is_deleted — 合并期间使用的列的名称，用于确定此行中的数据表示状态还是将被删除；1 是“已删除”行，0 是“状态”行。

列数据类型 — UInt8。

注意

仅当使用 ver 时，才能启用 is_deleted。

仅当 OPTIMIZE ... FINAL CLEANUP 时，行才会被删除。除非启用 allow_experimental_replacing_merge_with_cleanup MergeTree 设置，否则默认情况下不允许使用此 CLEANUP 特殊关键字。

无论对数据执行何种操作，版本都必须增加。如果两个插入的行具有相同的版本号，则保留最后插入的行。

示例

-- with ver and is_deleted
CREATE OR REPLACE TABLE myThirdReplacingMT
(
    `key` Int64,
    `someCol` String,
    `eventTime` DateTime,
    `is_deleted` UInt8
)
ENGINE = ReplacingMergeTree(eventTime, is_deleted)
ORDER BY key
SETTINGS allow_experimental_replacing_merge_with_cleanup = 1;

INSERT INTO myThirdReplacingMT Values (1, 'first', '2020-01-01 01:01:01', 0);
INSERT INTO myThirdReplacingMT Values (1, 'first', '2020-01-01 01:01:01', 1);

select * from myThirdReplacingMT final;

0 rows in set. Elapsed: 0.003 sec.

-- delete rows with is_deleted
OPTIMIZE TABLE myThirdReplacingMT FINAL CLEANUP;

INSERT INTO myThirdReplacingMT Values (1, 'first', '2020-01-01 00:00:00', 0);

select * from myThirdReplacingMT final;

┌─key─┬─someCol─┬───────────eventTime─┬─is_deleted─┐
│   1 │ first   │ 2020-01-01 00:00:00 │          0 │
└─────┴─────────┴─────────────────────┴────────────┘

查询子句

创建 ReplacingMergeTree 表时，需要与创建 MergeTree 表时相同的子句。

创建表的已弃用方法

注意

在新项目和旧项目中不要使用此方法，如果可能，请将旧项目切换到上述方法。

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE [=] ReplacingMergeTree(date-column [, sampling_expression], (primary, key), index_granularity, [ver])

除 ver 之外的所有参数都具有与 MergeTree 中相同的含义。

ver - 带有版本的列。可选参数。有关描述，请参见上面的文本。

查询时去重 & FINAL

在合并时，ReplacingMergeTree 会识别重复的行，使用 ORDER BY 列（用于创建表）的值作为唯一标识符，并且仅保留最高版本。但是，这仅提供最终的正确性 - 它不保证行将被去重，您不应依赖它。因此，由于在查询中考虑了更新和删除行，查询可能会产生不正确的答案。

要获得正确的答案，用户将需要使用查询时去重和删除删除来补充后台合并。这可以使用 FINAL 运算符来实现。例如，考虑以下示例

CREATE TABLE rmt_example
(
    `number` UInt16
)
ENGINE = ReplacingMergeTree
ORDER BY number

INSERT INTO rmt_example SELECT floor(randUniform(0, 100)) AS number
FROM numbers(1000000000)

0 rows in set. Elapsed: 19.958 sec. Processed 1.00 billion rows, 8.00 GB (50.11 million rows/s., 400.84 MB/s.)

不使用 FINAL 查询会产生不正确的计数（确切结果将因合并而异）

SELECT count()
FROM rmt_example

┌─count()─┐
│     200 │
└─────────┘

1 row in set. Elapsed: 0.002 sec.

添加 final 会产生正确的结果

SELECT count()
FROM rmt_example
FINAL

┌─count()─┐
│     100 │
└─────────┘

1 row in set. Elapsed: 0.002 sec.

有关 FINAL 的更多详细信息，包括如何优化 FINAL 性能，我们建议阅读我们的关于 ReplacingMergeTree 的详细指南。

创建表​

ReplacingMergeTree 参数​

ver​

is_deleted​

查询子句​

查询时去重 & FINAL​

创建表

ReplacingMergeTree 参数

ver

is_deleted

查询子句

查询时去重 & FINAL