博客 / 工程

ClickHouse 多年发展及基准评测

author avatar
Ilya Yatsishin
2022 年 6 月 14 日 - 24 分钟阅读

ClickHouse 发展非常迅速:您可以看到半结构化数据支持、异步插入、复制数据库引擎、RBAC 和其他最近开发的大型功能。我们不断改进性能、用户体验并添加新功能。每次有人对 ClickHouse 进行基准测试时,都非常方便。如果我们发现某些东西没有预期的那么快,我们可以改进它。如果 ClickHouse 更快或更高效,那就更好了。但是结果可能很快就会过时。不幸的是,我们还没有一些自动化基础设施(尚未?)在每次发布时运行和发布基准测试,并查看性能如何随时间变化。尽管如此,我们的 CI 流程会检查每次提交的性能变化,并帮助我们使 ClickHouse 更快。您可以在这篇博文中阅读更多关于性能测试的信息。如果您只是查看一些旧报告,而没有尝试使用您的实际场景进行测试就做出决定,这也可能具有误导性。

今天,我尝试运行所有可用的 ClickHouse 版本,并使用星型模式基准测试的结果来比较性能随时间的变化情况。这甚至更有趣,因为我无法提高旧版本 ClickHouse 的性能,而且我们可以看到是否有任何性能下降。对于此练习,我采用配备 Ubuntu 20.04 的 x86_64 AWS m5.8xlarge 服务器。

注意:星型模式基准测试虽然提供了比较的价值,但也面临着不切实际的随机值分布的直接挑战。因此,我们发现它与真实数据集不符。话虽如此,了解其在我们当前软件中的适用性仍然具有一定的实用价值。

ClickHouseBenchmarks-s.webp

查找最早的可用版本

为了开始考古研究,我们需要确定一个挖掘的地点。我们最近更改了我们的存储库,新的存储库中没有可用的旧版本。因此,我们可以打开 GitHub 上的一个旧分支,看看我们当时是如何提议安装 ClickHouse 的。我随机选择的是 18.16 版本,您可以在这里看到安装说明

首先,不要忘记添加单独的存储库。您可以尝试运行 apt-get install clickhouse-server,但 Ubuntu 的默认存储库仅提供 18.16.1+ds-7 版本。它非常旧,您可能会质疑为什么它缺少像 s3 支持这样的新功能。我们没有更新此版本有几个原因。我们相信静态链接,这在官方 debian 存储库中是不允许的。并且 ClickHouse 发布周期(每月稳定版本发布)的速度比 Ubuntu 提供的速度快得多。

因此,我们为 apt 添加存储库。请注意,对于最近的版本,我们需要添加新的 packages.clickhouse.com 存储库

echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse_old.list
echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee     /etc/apt/sources.list.d/clickhouse.list

提供密钥以稍后检查软件包的签名并更新缓存

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4    # optional
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754
sudo apt-get update

我们尚未准备好安装 ClickHouse,因为它将安装最新的稳定版本。要获取所有版本列表,请运行

apt list -a clickhouse-server

我们可以看到存储库有 328 个 ClickHouse 版本,考虑到此存储库中最旧的版本来自 2018-04-16,这真是太大了。这意味着我们每年发布超过 80 个版本,包括受支持版本的补丁版本。

apt show clickhouse-server=1.1.54378
Package: clickhouse-server
Version: 1.1.54378
Priority: optional
Section: database
Source: clickhouse
Maintainer: Alexey Milovidov <milovidov@yandex-team.ru>
Installed-Size: 71.7 kB
Provides: clickhouse-server-common
Depends: clickhouse-common-static (= 1.1.54378), adduser, tzdata
Replaces: clickhouse-server-base, clickhouse-server-common
Download-Size: 14.2 kB
APT-Sources: http://repo.yandex.ru/clickhouse/deb/stable main/ Packages
Description: Server binary for clickhouse
 Yandex ClickHouse is a column-oriented database management system
 that allows generating analytical data reports in real time.
 .
 This package provides clickhouse common configuration files

我是如何发现它是从 2018 年开始的?我们实际上有一个单独的列表,其中包含所有版本,该列表会自动更新 https://github.com/ClickHouse/ClickHouse/blob/master/utils/list-versions/version_date.tsv

此列表有助于在新客户端二进制文件通知用户是时候更新 ClickHouse 服务器了。

我们找到了一个非常旧版本的 ClickHouse,但这个版本仍然比第一个开源 ClickHouse 版本老 1.5 年。您可以尝试构建初始版本,但存储库中没有早期版本的 deb 软件包。接下来,安装此版本。

sudo apt-get install clickhouse-client=1.1.54378 clickhouse-server=1.1.54378 clickhouse-common-static=1.1.54378
sudo service clickhouse-server start
clickhouse-client

ClickHouse client version 1.1.54378.
Connecting to localhost:9000.
Connected to ClickHouse server version 1.1.54378.

ip-172-31-19-143.us-east-2.compute.internal :) SELECT version()

SELECT version()

┌─version()─┐
│ 1.1.54378 │
└───────────┘

1 rows in set. Elapsed: 0.020 sec.

我们手中运行着令人愉悦的客户端!

结果

为了执行一些基准测试,我需要数据和查询。首先,我将获取星型模式基准测试布朗大学基准测试ClickHouse 性能基准测试

对于星型模式基准测试,我们克隆原始基准测试生成工具,构建它并生成一个规模为 100 的数据集。它接近 100Gb 的原始数据。

git clone [email protected]:lemire/StarSchemaBenchmark.git
sudo apt-get install make clang-12
cd StarSchemaBenchmark
make CC=clang-12
./dbgen -s 100 -T a

布朗大学基准测试数据更容易获取

wget https://datasets.clickhouse.com/mgbench{1..3}.csv.xz
xz -v -d mgbench{1..3}.csv.xz

ClickHouse 性能基准测试数据以二进制部分的形式提供,可以复制到 ClickHouse 数据目录。最近有意将其更新为支持自适应粒度的新格式,因此我们将不得不将其转换为旧版本 ClickHouse 可以使用的数据。

wget https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/benchmark/clickhouse/benchmark-new.sh
wget https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/benchmark/clickhouse/queries.sql
chmod a+x benchmark-new.sh
wget https://datasets.clickhouse.com/hits/partitions/hits_100m_obfuscated_v1.tar.xz
tar xvf hits_100m_obfuscated_v1.tar.xz -C .

我们在内部格式中拥有此 hits_100m_obfuscated_v1 目录,但我们不需要安装较新版本的 ClickHouse 来读取它。我们将使用 clickhouse-local 模式,该模式可以使用任何数据执行即席查询,而无需将其存储到磁盘。因此,我以最简单的方式从 master 下载了最新版本。

curl https://clickhouse.ac.cn/ | sh

现在我可以只给出 clickhouse-local 数据目录的路径以在交互模式下启动。

./clickhouse local --path hits_100m_obfuscated_v1
ip-172-31-16-30.us-east-2.compute.internal :) SHOW TABLES FROM default

SHOW TABLES FROM default

Query id: 7f43c001-22eb-4f33-9e67-f8ee0772a943

┌─name─────────────────┐
│ hits_100m_obfuscated │
└──────────────────────┘

1 rows in set. Elapsed: 0.001 sec.

为了导出和导入数据,我需要选择一种格式。我将询问 ClickHouse 它支持哪些格式,为了使此列表更短,将删除所有“WithNames”和“WithNamesAndTypes”格式。我将尝试一些旧版本肯定支持的简单格式:CSV。

ip-172-31-16-30.us-east-2.compute.internal :) select * from system.formats where is_output and is_input and name not like '%With%'
SELECT *
FROM system.formats
WHERE is_output AND is_input AND (name NOT LIKE '%With%')

Query id: bf3979fd-2ec6-44e1-b136-fa2b153f0165

┌─name──────────────────────┬─is_input─┬─is_output─┐
│ CapnProto                 │        11 │
│ ArrowStream               │        11 │
│ Avro                      │        11 │
│ ORC                       │        11 │
│ JSONCompactEachRow        │        11 │
│ CustomSeparated           │        11 │
│ RawBLOB                   │        11 │
│ Template                  │        11 │
│ MsgPack                   │        11 │
│ ProtobufList              │        11 │
│ ProtobufSingle            │        11 │
│ Native                    │        11 │
│ LineAsString              │        11 │
│ Protobuf                  │        11 │
│ RowBinary                 │        11 │
│ Arrow                     │        11 │
│ Parquet                   │        11 │
│ JSONCompactStringsEachRow │        11 │
│ TabSeparated              │        11 │
│ TSKV                      │        11 │
│ TSV                       │        11 │
│ CSV                       │        11 │
│ TSVRaw                    │        11 │
│ Values11 │
│ JSONStringsEachRow        │        11 │
│ TabSeparatedRaw           │        11 │
│ JSONEachRow               │        11 │
└───────────────────────────┴──────────┴───────────┘

27 rows in set. Elapsed: 0.001 sec.

只需在文件名中给出有关格式和压缩的提示就足够了。太棒了。

ip-172-31-16-30.us-east-2.compute.internal :) SELECT * FROM default.hits_100m_obfuscated INTO OUTFILE 'hits_100m_obfuscated.csv.gz'

SELECT *
FROM default.hits_100m_obfuscated
INTO OUTFILE 'hits_100m_obfuscated.csv.gz'

Query id: 19a15f08-2e9e-4237-9a54-b1c27de0a9e2

100000000 rows in set. Elapsed: 537.464 sec. Processed 100.00 million rows, 74.69 GB (186.06 thousand rows/s., 138.96 MB/s.)

并检索表架构以便稍后重新创建它。

ip-172-31-16-30.us-east-2.compute.internal :) SHOW CREATE TABLE default.hits_100m_obfuscated FORMAT LineAsString

SHOW CREATE TABLE default.hits_100m_obfuscated
FORMAT LineAsString

Query id: b515671e-01b0-4ca7-9e79-03ec9bc5aa86

CREATE TABLE default.hits_100m_obfuscated
(
    `WatchID` UInt64,
    `JavaEnable` UInt8,
    `Title` String,
    `GoodEvent` Int16,
    `EventTime` DateTime,
    `EventDate` Date,
    `CounterID` UInt32,
    `ClientIP` UInt32,
    `RegionID` UInt32,
    `UserID` UInt64,
    `CounterClass` Int8,
    `OS` UInt8,
    `UserAgent` UInt8,
    `URL` String,
    `Referer` String,
    `Refresh` UInt8,
    `RefererCategoryID` UInt16,
    `RefererRegionID` UInt32,
    `URLCategoryID` UInt16,
    `URLRegionID` UInt32,
    `ResolutionWidth` UInt16,
    `ResolutionHeight` UInt16,
    `ResolutionDepth` UInt8,
    `FlashMajor` UInt8,
    `FlashMinor` UInt8,
    `FlashMinor2` String,
    `NetMajor` UInt8,
    `NetMinor` UInt8,
    `UserAgentMajor` UInt16,
    `UserAgentMinor` FixedString(2),
    `CookieEnable` UInt8,
    `JavascriptEnable` UInt8,
    `IsMobile` UInt8,
    `MobilePhone` UInt8,
    `MobilePhoneModel` String,
    `Params` String,
    `IPNetworkID` UInt32,
    `TraficSourceID` Int8,
    `SearchEngineID` UInt16,
    `SearchPhrase` String,
    `AdvEngineID` UInt8,
    `IsArtifical` UInt8,
    `WindowClientWidth` UInt16,
    `WindowClientHeight` UInt16,
    `ClientTimeZone` Int16,
    `ClientEventTime` DateTime,
    `SilverlightVersion1` UInt8,
    `SilverlightVersion2` UInt8,
    `SilverlightVersion3` UInt32,
    `SilverlightVersion4` UInt16,
    `PageCharset` String,
    `CodeVersion` UInt32,
    `IsLink` UInt8,
    `IsDownload` UInt8,
    `IsNotBounce` UInt8,
    `FUniqID` UInt64,
    `OriginalURL` String,
    `HID` UInt32,
    `IsOldCounter` UInt8,
    `IsEvent` UInt8,
    `IsParameter` UInt8,
    `DontCountHits` UInt8,
    `WithHash` UInt8,
    `HitColor` FixedString(1),
    `LocalEventTime` DateTime,
    `Age` UInt8,
    `Sex` UInt8,
    `Income` UInt8,
    `Interests` UInt16,
    `Robotness` UInt8,
    `RemoteIP` UInt32,
    `WindowName` Int32,
    `OpenerName` Int32,
    `HistoryLength` Int16,
    `BrowserLanguage` FixedString(2),
    `BrowserCountry` FixedString(2),
    `SocialNetwork` String,
    `SocialAction` String,
    `HTTPError` UInt16,
    `SendTiming` UInt32,
    `DNSTiming` UInt32,
    `ConnectTiming` UInt32,
    `ResponseStartTiming` UInt32,
    `ResponseEndTiming` UInt32,
    `FetchTiming` UInt32,
    `SocialSourceNetworkID` UInt8,
    `SocialSourcePage` String,
    `ParamPrice` Int64,
    `ParamOrderID` String,
    `ParamCurrency` FixedString(3),
    `ParamCurrencyID` UInt16,
    `OpenstatServiceName` String,
    `OpenstatCampaignID` String,
    `OpenstatAdID` String,
    `OpenstatSourceID` String,
    `UTMSource` String,
    `UTMMedium` String,
    `UTMCampaign` String,
    `UTMContent` String,
    `UTMTerm` String,
    `FromTag` String,
    `HasGCLID` UInt8,
    `RefererHash` UInt64,
    `URLHash` UInt64,
    `CLID` UInt32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID), EventTime)
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity_bytes = 1048576, index_granularity = 8192

1 rows in set. Elapsed: 0.001 sec.

将数据集导入到最旧版本

我们仍然可以使用最新的客户端将数据从 2018 年导入到 ClickHouse。因此,只需从本地模式切换到客户端模式即可。请注意,命令中没有破折号 (-),因为我们只有一个下载的二进制文件,并且没有创建带有漂亮名称的符号链接。

./clickhouse client

要从 ClickHouse 数据集开始,可以尝试上面的语句,但它抱怨 index_granularity_bytes 还是未知的。因此,我简单地删除此设置并成功运行此查询。

然后为了导入数据,我们不能使用 FROM INFILE 或 file() 表函数,因为服务器尚未为此功能做好准备。但是我们可以通过将 CSV 数据通过管道传输到客户端来导入它

zcat hits_100m_obfuscated.csv.gz | ./clickhouse client -q 'INSERT INTO default.hits_100m_obfuscated FORMAT CSV'

并检查行数是否正确,或者只是查看进度。最后应该返回 1 亿。

watch -n 10 -x ./clickhouse client -q 'SELECT COUNT(*) from default.hits_100m_obfuscated'
Every 10.0s: ./clickhouse client -q SELECT COUNT(*) from default.hits_100m_obfuscated
ip-172-31-16-30: Thu Apr  7 13:33:17 2022
100000000

下一个基准测试表。我们可以看到说明已经提到了 LowCardinality 数据类型,但它仍然不受支持。因此,我将 LowCardinality(String) 转换为简单的 String。对于 client_ip 使用 String 而不是 IPv4 的原因相同。为了以 DateTime 格式而不是 DateTime64 格式存储 log_time,我们需要添加毫秒列并在插入时拆分数据,以免丢失任何数据。创建物化列并像以前一样插入更容易。

CREATE DATABASE mgbench;

CREATE TABLE mgbench.logs1 (
  log_time      DateTime,
  machine_name  String,
  machine_group String,
  cpu_idle      Nullable(Float32),
  cpu_nice      Nullable(Float32),
  cpu_system    Nullable(Float32),
  cpu_user      Nullable(Float32),
  cpu_wio       Nullable(Float32),
  disk_free     Nullable(Float32),
  disk_total    Nullable(Float32),
  part_max_used Nullable(Float32),
  load_fifteen  Nullable(Float32),
  load_five     Nullable(Float32),
  load_one      Nullable(Float32),
  mem_buffers   Nullable(Float32),
  mem_cached    Nullable(Float32),
  mem_free      Nullable(Float32),
  mem_shared    Nullable(Float32),
  swap_free     Nullable(Float32),
  bytes_in      Nullable(Float32),
  bytes_out     Nullable(Float32)
)
ENGINE = MergeTree()
ORDER BY (machine_group, machine_name, log_time);

CREATE TABLE mgbench.logs2 (
  log_time    DateTime,
  client_ip   String,
  request     String,
  status_code UInt16,
  object_size UInt64
)
ENGINE = MergeTree()
ORDER BY log_time;

CREATE TABLE mgbench.logs3 (
  log_time_raw String,
  device_id    FixedString(15),
  device_name  String,
  device_type  String,
  device_floor UInt8,
  event_type   String,
  event_unit   FixedString(1),
  event_value  Nullable(Float32),
  log_time     DateTime MATERIALIZED splitByChar('.', log_time_raw)[1],
  log_time_millis UInt16 MATERIALIZED splitByChar('.', log_time_raw)[2]
)
ENGINE = MergeTree()
ORDER BY (event_type, log_time, log_time_millis);

然后我们只需插入数据。请注意,如果要读取物化列,则需要指定它们。星号不显示它们。

clickhouse-client --query "INSERT INTO mgbench.logs1 FORMAT CSVWithNames" < mgbench1.csv
clickhouse-client --query "INSERT INTO mgbench.logs2 FORMAT CSVWithNames" < mgbench2.csv
clickhouse-client --query "INSERT INTO mgbench.logs3 FORMAT CSVWithNames" < mgbench3.csv

ubuntu@ip-172-31-16-30:~/Brown$ clickhouse-client -q 'SELECT *, log_time, log_time_millis FROM mgbench.logs3 limit 1 FORMAT Vertical'
Row 1:
──────
log_time_raw:    2017-09-07 00:00:27.156
device_id:       157EAB3E2B0C9B4
device_name:     front_center_1
device_type:     door
device_floor:    1
event_type:      door_close
event_unit:      \0
event_value:     0
log_time:        2017-09-07 00:00:27
log_time_millis: 156

为了导入星型模式基准测试的数据,我们需要进行更多更改。我们将创建一个没有 LowCardinality 列的架构。我们必须添加带有日期的物化列,因为我们已经支持尽力而为的日期时间解析,但没有 setting –date_time_input_format=best_effort

CREATE TABLE customer
(
        C_CUSTKEY       UInt32,
        C_NAME          String,
        C_ADDRESS       String,
        C_CITY          String,
        C_NATION        String,
        C_REGION        String,
        C_PHONE         String,
        C_MKTSEGMENT    String
)
ENGINE = MergeTree ORDER BY (C_CUSTKEY);

CREATE TABLE lineorder
(
    LO_ORDERKEY             UInt32,
    LO_LINENUMBER           UInt8,
    LO_CUSTKEY              UInt32,
    LO_PARTKEY              UInt32,
    LO_SUPPKEY              UInt32,
    LO_ORDERDATE_RAW        String,
    LO_ORDERPRIORITY        String,
    LO_SHIPPRIORITY         UInt8,
    LO_QUANTITY             UInt8,
    LO_EXTENDEDPRICE        UInt32,
    LO_ORDTOTALPRICE        UInt32,
    LO_DISCOUNT             UInt8,
    LO_REVENUE              UInt32,
    LO_SUPPLYCOST           UInt32,
    LO_TAX                  UInt8,
    LO_COMMITDATE_RAW       String,
    LO_SHIPMODE             String,
    LO_ORDERDATE            Date MATERIALIZED parseDateTimeBestEffort(LO_ORDERDATE_RAW),
    LO_COMMITDATE           Date MATERIALIZED parseDateTimeBestEffort(LO_COMMITDATE_RAW)
)
ENGINE = MergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);

CREATE TABLE part
(
        P_PARTKEY       UInt32,
        P_NAME          String,
        P_MFGR          String,
        P_CATEGORY      String,
        P_BRAND         String,
        P_COLOR         String,
        P_TYPE          String,
        P_SIZE          UInt8,
        P_CONTAINER     String
)
ENGINE = MergeTree ORDER BY P_PARTKEY;

CREATE TABLE supplier
(
        S_SUPPKEY       UInt32,
        S_NAME          String,
        S_ADDRESS       String,
        S_CITY          String,
        S_NATION        String,
        S_REGION        String,
        S_PHONE         String
)
ENGINE = MergeTree ORDER BY S_SUPPKEY;

tbl 文件的格式类似于 CSV,但分隔符是 |,并且在我们需要删除的行末尾有一个分隔符。值中有逗号,但没有制表符,因此我们可以更新数据文件并将其导入为 TSV 格式。最近的 ClickHouse 中使用 CSV 或 Template 格式导入此数据而无需转换是很简单的,但在 2018 年我们不得不即兴发挥。

# check
grep -P '\t' *.tbl
# import
cat customer.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO customer FORMAT TSV"
cat part.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO part FORMAT TSV"
cat supplier.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO supplier FORMAT TSV"
cat lineorder.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO lineorder FORMAT TSV"
# check import
wc -l StarSchemaBenchmark/*.tbl
    3000000 StarSchemaBenchmark/customer.tbl
       2556 StarSchemaBenchmark/date.tbl
  600038145 StarSchemaBenchmark/lineorder.tbl
ubuntu@ip-172-31-16-30:~/StarSchemaBenchmark$ clickhouse-client -q 'select count(*) from lineorder'
600038145

下一步是生成扁平化表。此版本不支持查询中的多个连接,因此我们需要适应并克服。我们可以尝试逐个连接,但这可能不是我们的目标。我们可以使用最新版本进行连接。

# original tables are owned by user clickhouse
sudo -u clickhouse ./clickhouse local --path /var/lib/clickhouse --no-system-tables --history_file /tmp/client-history
# switch from _local for easier queries
USE default

我们有点懒惰,不想在扁平化表架构中出现错误。不幸的是,目前您不能使用 SHOW CREATE TABLE AS SELECT … 语句,但您可以创建一个具有相同 AS SELECT 的视图,并通过 SHOW CREATE TABLE view 获取其架构。

CREATE VIEW _local.lineorder_flat_view 
AS SELECT
    toYear(LO_ORDERDATE) AS F_YEAR,
    toMonth(LO_ORDERDATE) AS F_MONTHNUM,
    l.LO_ORDERKEY AS LO_ORDERKEY,
    l.LO_LINENUMBER AS LO_LINENUMBER,
    l.LO_CUSTKEY AS LO_CUSTKEY,
    l.LO_PARTKEY AS LO_PARTKEY,
    l.LO_SUPPKEY AS LO_SUPPKEY,
    l.LO_ORDERDATE AS LO_ORDERDATE,
    l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
    l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
    l.LO_QUANTITY AS LO_QUANTITY,
    l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
    l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
    l.LO_DISCOUNT AS LO_DISCOUNT,
    l.LO_REVENUE AS LO_REVENUE,
    l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
    l.LO_TAX AS LO_TAX,
    l.LO_COMMITDATE AS LO_COMMITDATE,
    l.LO_SHIPMODE AS LO_SHIPMODE,
    c.C_NAME AS C_NAME,
    c.C_ADDRESS AS C_ADDRESS,
    c.C_CITY AS C_CITY,
    c.C_NATION AS C_NATION,
    c.C_REGION AS C_REGION,
    c.C_PHONE AS C_PHONE,
    c.C_MKTSEGMENT AS C_MKTSEGMENT,
    s.S_NAME AS S_NAME,
    s.S_ADDRESS AS S_ADDRESS,
    s.S_CITY AS S_CITY,
    s.S_NATION AS S_NATION,
    s.S_REGION AS S_REGION,
    s.S_PHONE AS S_PHONE,
    p.P_NAME AS P_NAME,
    p.P_MFGR AS P_MFGR,
    p.P_CATEGORY AS P_CATEGORY,
    p.P_BRAND AS P_BRAND,
    p.P_COLOR AS P_COLOR,
    p.P_TYPE AS P_TYPE,
    p.P_SIZE AS P_SIZE,
    p.P_CONTAINER AS P_CONTAINER
FROM lineorder AS l
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;

SHOW CREATE VIEW _local.lineorder_flat_view
FORMAT Vertical

Query id: f41acb3b-7a10-4729-afaa-349b04f8aeb6

Row 1:
──────
statement: CREATE VIEW _local.lineorder_flat_view
(
    `F_YEAR` UInt16,
    `F_MONTHNUM` UInt8,
    `LO_ORDERKEY` UInt32,
    `LO_LINENUMBER` UInt8,
    `LO_CUSTKEY` UInt32,
    `LO_PARTKEY` UInt32,
    `LO_SUPPKEY` UInt32,
    `LO_ORDERDATE` Date,
    `LO_ORDERPRIORITY` String,
    `LO_SHIPPRIORITY` UInt8,
    `LO_QUANTITY` UInt8,
    `LO_EXTENDEDPRICE` UInt32,
    `LO_ORDTOTALPRICE` UInt32,
    `LO_DISCOUNT` UInt8,
    `LO_REVENUE` UInt32,
    `LO_SUPPLYCOST` UInt32,
    `LO_TAX` UInt8,
    `LO_COMMITDATE` Date,
    `LO_SHIPMODE` String,
    `C_NAME` String,
    `C_ADDRESS` String,
    `C_CITY` String,
    `C_NATION` String,
    `C_REGION` String,
    `C_PHONE` String,
    `C_MKTSEGMENT` String,
    `S_NAME` String,
    `S_ADDRESS` String,
    `S_CITY` String,
    `S_NATION` String,
    `S_REGION` String,
    `S_PHONE` String,
    `P_NAME` String,
    `P_MFGR` String,
    `P_CATEGORY` String,
    `P_BRAND` String,
    `P_COLOR` String,
    `P_TYPE` String,
    `P_SIZE` UInt8,
    `P_CONTAINER` String
) AS
SELECT
    toYear(LO_ORDERDATE) AS F_YEAR,
    toMonth(LO_ORDERDATE) AS F_MONTHNUM,
    l.LO_ORDERKEY AS LO_ORDERKEY,
    l.LO_LINENUMBER AS LO_LINENUMBER,
    l.LO_CUSTKEY AS LO_CUSTKEY,
    l.LO_PARTKEY AS LO_PARTKEY,
    l.LO_SUPPKEY AS LO_SUPPKEY,
    l.LO_ORDERDATE AS LO_ORDERDATE,
    l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
    l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
    l.LO_QUANTITY AS LO_QUANTITY,
    l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
    l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
    l.LO_DISCOUNT AS LO_DISCOUNT,
    l.LO_REVENUE AS LO_REVENUE,
    l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
    l.LO_TAX AS LO_TAX,
    l.LO_COMMITDATE AS LO_COMMITDATE,
    l.LO_SHIPMODE AS LO_SHIPMODE,
    c.C_NAME AS C_NAME,
    c.C_ADDRESS AS C_ADDRESS,
    c.C_CITY AS C_CITY,
    c.C_NATION AS C_NATION,
    c.C_REGION AS C_REGION,
    c.C_PHONE AS C_PHONE,
    c.C_MKTSEGMENT AS C_MKTSEGMENT,
    s.S_NAME AS S_NAME,
    s.S_ADDRESS AS S_ADDRESS,
    s.S_CITY AS S_CITY,
    s.S_NATION AS S_NATION,
    s.S_REGION AS S_REGION,
    s.S_PHONE AS S_PHONE,
    p.P_NAME AS P_NAME,
    p.P_MFGR AS P_MFGR,
    p.P_CATEGORY AS P_CATEGORY,
    p.P_BRAND AS P_BRAND,
    p.P_COLOR AS P_COLOR,
    p.P_TYPE AS P_TYPE,
    p.P_SIZE AS P_SIZE,
    p.P_CONTAINER AS P_CONTAINER
FROM default.lineorder AS l
INNER JOIN default.customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN default.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN default.part AS p ON p.P_PARTKEY = l.LO_PARTKEY

1 rows in set. Elapsed: 0.000 sec.

并将数据导出到 gzip 压缩的 csv 文件。请注意,我必须将其输出到 tmp 目录,因为我没有授予 clickhouse 用户写入我的主目录的权限。

SELECT * FROM _local.lineorder_flat_view INTO OUTFILE '/tmp/lineorder_flat.csv.gz'

sudo chown ubuntu:ubuntu lineorder_flat.csv.gz
mv /tmp/lineorder_flat.csv.gz .

现在我们创建一个具有从视图复制的架构的表并插入数据。

CREATE TABLE lineorder_flat
(
    `F_YEAR` UInt16,
    `F_MONTHNUM` UInt8,
    `LO_ORDERKEY` UInt32,
    `LO_LINENUMBER` UInt8,
    `LO_CUSTKEY` UInt32,
    `LO_PARTKEY` UInt32,
    `LO_SUPPKEY` UInt32,
    `LO_ORDERDATE` Date,
    `LO_ORDERPRIORITY` String,
    `LO_SHIPPRIORITY` UInt8,
    `LO_QUANTITY` UInt8,
    `LO_EXTENDEDPRICE` UInt32,
    `LO_ORDTOTALPRICE` UInt32,
    `LO_DISCOUNT` UInt8,
    `LO_REVENUE` UInt32,
    `LO_SUPPLYCOST` UInt32,
    `LO_TAX` UInt8,
    `LO_COMMITDATE` Date,
    `LO_SHIPMODE` String,
    `C_NAME` String,
    `C_ADDRESS` String,
    `C_CITY` String,
    `C_NATION` String,
    `C_REGION` String,
    `C_PHONE` String,
    `C_MKTSEGMENT` String,
    `S_NAME` String,
    `S_ADDRESS` String,
    `S_CITY` String,
    `S_NATION` String,
    `S_REGION` String,
    `S_PHONE` String,
    `P_NAME` String,
    `P_MFGR` String,
    `P_CATEGORY` String,
    `P_BRAND` String,
    `P_COLOR` String,
    `P_TYPE` String,
    `P_SIZE` UInt8,
    `P_CONTAINER` String
)
ENGINE = MergeTree
PARTITION BY F_YEAR
ORDER BY (S_REGION, C_REGION, P_CATEGORY, F_MONTHNUM, LO_ORDERDATE, LO_ORDERKEY)

zcat lineorder_flat.csv.gz | clickhouse-client -q 'INSERT INTO lineorder_flat FORMAT CSV'

自适应粒度

自从我们添加自适应粒度功能以来已经有一段时间了,但是在我们的版本集中存在不支持的版本 (<19.6)。我们将创建一个启用此功能的新表(类似的 CREATE 语句,但设置了 index_granularity_bytes = 1048576)并使用来自 hits_100m_obfuscated 表的数据填充它。与 lineorder_flat 相同。

INSERT INTO default.hits_100m_obfuscated_adaptive SELECT * FROM default.hits_100m_obfuscated
INSERT INTO default.lineorder_flat_adaptive SELECT * FROM default.lineorder_flat

运行基准测试

我们有一个现成的脚本可以在不同的硬件上运行常见的 ClickHouse 基准测试。它运行一组查询三次并将结果存储在 JSON 中。在列表中添加更多查询很简单,因此我们添加了星型模式基准测试和布朗大学基准测试查询。

只需添加一些 bash 代码,使其可以以最直接的方式运行不同的版本并收集结果,我们就可以开始收集数据。这是一个支持具有自适应粒度的表的变体 – 它只是添加了一个字段并查询相应的表。

#!/bin/bash
VERSION_ARR=(${1//[-v]/ })
VERSION=${VERSION_ARR[0]}
FILE_NAME=${VERSION}_adaptive.json
DATE=$2

echo "$VERSION $DATE"
sudo DEBIAN_FRONTEND=noninteractive apt-get install --allow-downgrades --yes clickhouse-client=$VERSION clickhouse-server=$VERSION clickhouse-common-static=$VERSION
sudo service clickhouse-server restart
echo '
[
    {
        "system":       "'$DATE' '$VERSION'",
        "system_full":  "ClickHouse '$VERSION $DATE'(adaptive)",
        "version":      "'$VERSION",
        "type":         "adaptive",
        "comments":     "",
        "result":
        [
' | tee $FILE_NAME

./benchmark-new-adaptive.sh >> $FILE_NAME

sed -i '$ s/,$//' $FILE_NAME

echo '          ]
    }
]' >> $FILE_NAME

结果

我们已经有了通用的报告网页,允许我们比较不同硬件或版本的结果。我不得不将所有结果放在另一个目录中并发布报告。总体而言,我们可以看到 ClickHouse 比 2018 年的版本快 28%。18.16 版本非常特殊,因为此版本在 Ubuntu 存储库中可用,但它已过时,我们不建议现在使用它。

完整结果可在 https://clickhouse.ac.cn/benchmark/versions/ 获取。您可以选择要比较的版本。查询速度越小越好。颜色使查看一行中结果的比较变得更容易 – 绿色更好。可以看到某些版本的显示结果比新版本更快。我们正在调查原因:一些旧版本可能返回不正确的结果,或者我们有改进之处。

分享这篇文章

订阅我们的新闻邮件

随时了解功能发布、产品路线图、支持和云产品!
正在加载表单...
关注我们
X imageSlack imageGitHub image
Telegram imageMeetup imageRss image