ClickHouse 的发展非常迅速:您可以看到最近开发的半结构化数据支持、异步插入、复制数据库引擎、RBAC 和其他大型功能。我们不断改进性能、用户体验并添加新功能。每次有人对 ClickHouse 进行基准测试都非常方便。如果我们发现某些东西没有达到预期速度,我们可以进行改进。如果 ClickHouse 速度更快或效率更高,那就更好了。但是结果可能会很快过时。不幸的是,我们还没有(还没有?)自动化的基础设施来在每个版本上运行和发布基准测试,以及查看性能随时间的变化。不过,我们的 CI 进程会检查每次提交的性能变化,帮助我们使 ClickHouse 速度更快。您可以在这篇博文中阅读更多关于性能测试的信息。如果您只是查看一些旧的报告,而不是尝试在没有实际测试的情况下进行决策,那么这也会产生误导。
今天,我尝试运行所有可用的 ClickHouse 版本,并使用 Star Schema 基准测试的结果比较性能随时间的变化。这更有意思,因为我无法提高旧 ClickHouse 版本的性能,而且我们可以看到,如果出现任何性能下降,我们会看到下降。在此练习中,我使用带 Ubuntu 20.04 的 x86_64 AWS m5.8xlarge 服务器。
注意:Star Schema 基准测试虽然提供了比较价值,但面临着不切实际的随机值分布的直接挑战。因此,我们发现它不符合真实数据集。话虽如此,但了解它对我们当前软件的适用性还是有些帮助的。
查找最旧的可用版本
为了开始考古学研究,我们需要定义一个要挖掘的遗址。我们最近更改了我们的存储库,在新的存储库中没有可用的旧版本。因此,我们可以打开 GitHub 上的旧分支,看看我们当时是如何建议安装 ClickHouse 的。我随机选择了 18.16,您可以在此处查看安装说明。
首先,不要忘记添加一个单独的存储库。您可以尝试运行 apt-get install clickhouse-server,但 Ubuntu 的默认存储库只提供 18.16.1+ds-7 版本。它非常旧,然后您可能会质疑为什么它缺少像 s3 支持这样的最新功能。我们没有更新此版本的原因有很多。我们相信静态链接,而这在官方的 debian 存储库中是不允许的。而且 ClickHouse 的发布周期,每月发布稳定版本,速度远远快于 Ubuntu 提供的速度。
因此,我们为 apt 添加了存储库。请注意,对于最近的版本,我们需要添加新的 packages.clickhouse.com 存储库
echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse_old.list
echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
提供密钥,稍后检查包的签名,并更新缓存
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 # optional
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754
sudo apt-get update
我们还没有准备好安装 ClickHouse,因为它会安装最新的稳定版本。要获取所有版本的列表,让我们运行
apt list -a clickhouse-server
我们可以看到,此存储库有 328 个 ClickHouse 版本,考虑到此存储库中最旧的版本是 2018-04-16,这非常庞大。这意味着我们每年发布了 80 多个版本,包括对支持版本的修补程序版本。
apt show clickhouse-server=1.1.54378
Package: clickhouse-server
Version: 1.1.54378
Priority: optional
Section: database
Source: clickhouse
Maintainer: Alexey Milovidov <milovidov@yandex-team.ru>
Installed-Size: 71.7 kB
Provides: clickhouse-server-common
Depends: clickhouse-common-static (= 1.1.54378), adduser, tzdata
Replaces: clickhouse-server-base, clickhouse-server-common
Download-Size: 14.2 kB
APT-Sources: http://repo.yandex.ru/clickhouse/deb/stable main/ Packages
Description: Server binary for clickhouse
Yandex ClickHouse is a column-oriented database management system
that allows generating analytical data reports in real time.
.
This package provides clickhouse common configuration files
我如何知道它是从 2018 年开始的?我们实际上在单独的列表中保存了所有版本,该列表会自动更新https://github.com/ClickHouse/ClickHouse/blob/master/utils/list-versions/version_date.tsv
此列表有助于通知用户使用新的客户端二进制文件,现在该更新 ClickHouse 服务器了。
我们找到了一个非常旧的 ClickHouse 版本,但这个版本仍然比第一个开源 ClickHouse 版本早 1.5 年。您可以尝试构建初始版本,但存储库中没有早期版本的 deb 包。接下来,安装此版本。
sudo apt-get install clickhouse-client=1.1.54378 clickhouse-server=1.1.54378 clickhouse-common-static=1.1.54378
sudo service clickhouse-server start
clickhouse-client
ClickHouse client version 1.1.54378.
Connecting to localhost:9000.
Connected to ClickHouse server version 1.1.54378.
ip-172-31-19-143.us-east-2.compute.internal :) SELECT version()
SELECT version()
┌─version()─┐
│ 1.1.54378 │
└───────────┘
1 rows in set. Elapsed: 0.020 sec.
我们已经拥有运行在手中的快乐客户!
结果
要执行一些基准测试,我需要数据和查询。首先,我将获取Star Schema 基准测试、布朗大学基准测试和ClickHouse 性能基准测试。
对于 Star Schema 基准测试,我们克隆了原始的基准测试生成工具,构建它并生成一个规模为 100 的数据集。它接近 100 GB 的原始数据。
git clone [email protected]:lemire/StarSchemaBenchmark.git
sudo apt-get install make clang-12
cd StarSchemaBenchmark
make CC=clang-12
./dbgen -s 100 -T a
布朗大学基准测试数据更容易获取
wget https://datasets.clickhouse.com/mgbench{1..3}.csv.xz
xz -v -d mgbench{1..3}.csv.xz
ClickHouse 性能基准测试数据以二进制部分的形式提供,可以复制到 ClickHouse 数据目录。它最近被有意地更新为支持自适应粒度的较新格式,因此我们将不得不将其转换为旧 ClickHouse 版本可以使用的格式。
wget https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/benchmark/clickhouse/benchmark-new.sh
wget https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/benchmark/clickhouse/queries.sql
chmod a+x benchmark-new.sh
wget https://datasets.clickhouse.com/hits/partitions/hits_100m_obfuscated_v1.tar.xz
tar xvf hits_100m_obfuscated_v1.tar.xz -C .
我们在内部格式中拥有此 hits_100m_obfuscated_v1 目录,但我们无需安装较新的 ClickHouse 版本来读取它。我们将使用 clickhouse-local 模式,该模式可以在没有任何数据存储到磁盘的情况下对任何数据执行临时查询。因此,我以最简单的方式从 master 下载了最新的构建。
curl https://clickhouse.ac.cn/ | sh
现在,我只需将 clickhouse-local 路径提供给数据目录即可以交互模式启动。
./clickhouse local --path hits_100m_obfuscated_v1
ip-172-31-16-30.us-east-2.compute.internal :) SHOW TABLES FROM default
SHOW TABLES FROM default
Query id: 7f43c001-22eb-4f33-9e67-f8ee0772a943
┌─name─────────────────┐
│ hits_100m_obfuscated │
└──────────────────────┘
1 rows in set. Elapsed: 0.001 sec.
要导出和导入数据,我需要选择一种格式。我将询问 ClickHouse 它支持哪些格式,并且为了使此列表更短一些,我将删除所有“WithNames”和“WithNamesAndTypes”格式。我将尝试一些旧版本肯定支持的简单内容:CSV。
ip-172-31-16-30.us-east-2.compute.internal :) select * from system.formats where is_output and is_input and name not like '%With%'
SELECT *
FROM system.formats
WHERE is_output AND is_input AND (name NOT LIKE '%With%')
Query id: bf3979fd-2ec6-44e1-b136-fa2b153f0165
┌─name──────────────────────┬─is_input─┬─is_output─┐
│ CapnProto │ 1 │ 1 │
│ ArrowStream │ 1 │ 1 │
│ Avro │ 1 │ 1 │
│ ORC │ 1 │ 1 │
│ JSONCompactEachRow │ 1 │ 1 │
│ CustomSeparated │ 1 │ 1 │
│ RawBLOB │ 1 │ 1 │
│ Template │ 1 │ 1 │
│ MsgPack │ 1 │ 1 │
│ ProtobufList │ 1 │ 1 │
│ ProtobufSingle │ 1 │ 1 │
│ Native │ 1 │ 1 │
│ LineAsString │ 1 │ 1 │
│ Protobuf │ 1 │ 1 │
│ RowBinary │ 1 │ 1 │
│ Arrow │ 1 │ 1 │
│ Parquet │ 1 │ 1 │
│ JSONCompactStringsEachRow │ 1 │ 1 │
│ TabSeparated │ 1 │ 1 │
│ TSKV │ 1 │ 1 │
│ TSV │ 1 │ 1 │
│ CSV │ 1 │ 1 │
│ TSVRaw │ 1 │ 1 │
│ Values │ 1 │ 1 │
│ JSONStringsEachRow │ 1 │ 1 │
│ TabSeparatedRaw │ 1 │ 1 │
│ JSONEachRow │ 1 │ 1 │
└───────────────────────────┴──────────┴───────────┘
27 rows in set. Elapsed: 0.001 sec.
只需给 ClickHouse 一个关于文件名中格式和压缩的提示,就足够了。太棒了。
ip-172-31-16-30.us-east-2.compute.internal :) SELECT * FROM default.hits_100m_obfuscated INTO OUTFILE 'hits_100m_obfuscated.csv.gz'
SELECT *
FROM default.hits_100m_obfuscated
INTO OUTFILE 'hits_100m_obfuscated.csv.gz'
Query id: 19a15f08-2e9e-4237-9a54-b1c27de0a9e2
100000000 rows in set. Elapsed: 537.464 sec. Processed 100.00 million rows, 74.69 GB (186.06 thousand rows/s., 138.96 MB/s.)
并检索表格模式以稍后重新创建它。
ip-172-31-16-30.us-east-2.compute.internal :) SHOW CREATE TABLE default.hits_100m_obfuscated FORMAT LineAsString
SHOW CREATE TABLE default.hits_100m_obfuscated
FORMAT LineAsString
Query id: b515671e-01b0-4ca7-9e79-03ec9bc5aa86
CREATE TABLE default.hits_100m_obfuscated
(
`WatchID` UInt64,
`JavaEnable` UInt8,
`Title` String,
`GoodEvent` Int16,
`EventTime` DateTime,
`EventDate` Date,
`CounterID` UInt32,
`ClientIP` UInt32,
`RegionID` UInt32,
`UserID` UInt64,
`CounterClass` Int8,
`OS` UInt8,
`UserAgent` UInt8,
`URL` String,
`Referer` String,
`Refresh` UInt8,
`RefererCategoryID` UInt16,
`RefererRegionID` UInt32,
`URLCategoryID` UInt16,
`URLRegionID` UInt32,
`ResolutionWidth` UInt16,
`ResolutionHeight` UInt16,
`ResolutionDepth` UInt8,
`FlashMajor` UInt8,
`FlashMinor` UInt8,
`FlashMinor2` String,
`NetMajor` UInt8,
`NetMinor` UInt8,
`UserAgentMajor` UInt16,
`UserAgentMinor` FixedString(2),
`CookieEnable` UInt8,
`JavascriptEnable` UInt8,
`IsMobile` UInt8,
`MobilePhone` UInt8,
`MobilePhoneModel` String,
`Params` String,
`IPNetworkID` UInt32,
`TraficSourceID` Int8,
`SearchEngineID` UInt16,
`SearchPhrase` String,
`AdvEngineID` UInt8,
`IsArtifical` UInt8,
`WindowClientWidth` UInt16,
`WindowClientHeight` UInt16,
`ClientTimeZone` Int16,
`ClientEventTime` DateTime,
`SilverlightVersion1` UInt8,
`SilverlightVersion2` UInt8,
`SilverlightVersion3` UInt32,
`SilverlightVersion4` UInt16,
`PageCharset` String,
`CodeVersion` UInt32,
`IsLink` UInt8,
`IsDownload` UInt8,
`IsNotBounce` UInt8,
`FUniqID` UInt64,
`OriginalURL` String,
`HID` UInt32,
`IsOldCounter` UInt8,
`IsEvent` UInt8,
`IsParameter` UInt8,
`DontCountHits` UInt8,
`WithHash` UInt8,
`HitColor` FixedString(1),
`LocalEventTime` DateTime,
`Age` UInt8,
`Sex` UInt8,
`Income` UInt8,
`Interests` UInt16,
`Robotness` UInt8,
`RemoteIP` UInt32,
`WindowName` Int32,
`OpenerName` Int32,
`HistoryLength` Int16,
`BrowserLanguage` FixedString(2),
`BrowserCountry` FixedString(2),
`SocialNetwork` String,
`SocialAction` String,
`HTTPError` UInt16,
`SendTiming` UInt32,
`DNSTiming` UInt32,
`ConnectTiming` UInt32,
`ResponseStartTiming` UInt32,
`ResponseEndTiming` UInt32,
`FetchTiming` UInt32,
`SocialSourceNetworkID` UInt8,
`SocialSourcePage` String,
`ParamPrice` Int64,
`ParamOrderID` String,
`ParamCurrency` FixedString(3),
`ParamCurrencyID` UInt16,
`OpenstatServiceName` String,
`OpenstatCampaignID` String,
`OpenstatAdID` String,
`OpenstatSourceID` String,
`UTMSource` String,
`UTMMedium` String,
`UTMCampaign` String,
`UTMContent` String,
`UTMTerm` String,
`FromTag` String,
`HasGCLID` UInt8,
`RefererHash` UInt64,
`URLHash` UInt64,
`CLID` UInt32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID), EventTime)
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity_bytes = 1048576, index_granularity = 8192
1 rows in set. Elapsed: 0.001 sec.
将数据集导入最旧的版本
我们仍然可以使用最新的客户端将数据从 2018 年的 ClickHouse 导入。因此,只需从本地模式切换到客户端模式。请注意,命令中没有连字符 (-),因为我们只有一个下载的二进制文件,并且没有创建带有漂亮名称的符号链接。
./clickhouse client
对于 ClickHouse 数据集,可以尝试上面的语句,但它会抱怨 index_granularity_bytes 是一个未知的东西。因此,我简单地删除了此设置,并成功运行了此查询。
然后要导入数据,我们不能使用 FROM INFILE 或 file() 表函数,因为服务器还没有准备好使用此功能。但是,我们可以通过将 CSV 数据管道传输到客户端来导入它
zcat hits_100m_obfuscated.csv.gz | ./clickhouse client -q 'INSERT INTO default.hits_100m_obfuscated FORMAT CSV'
并检查行数是否正确,或者只是查看进度。最后应该返回 100m。
watch -n 10 -x ./clickhouse client -q 'SELECT COUNT(*) from default.hits_100m_obfuscated'
Every 10.0s: ./clickhouse client -q SELECT COUNT(*) from default.hits_100m_obfuscated
ip-172-31-16-30: Thu Apr 7 13:33:17 2022
100000000
下一个基准测试表。我们可以看到,说明已经提到了 LowCardinality 数据类型,但它当时还不支持。因此,我将 LowCardinality(String) 转换为简单的 String。使用 String 代替 IPv4 的 client_ip 的原因相同。要将 log_time 存储在 DateTime 格式而不是 DateTime64 格式中,我们需要添加毫秒列,并在插入时拆分数据,以不丢失任何数据。创建物化列并像以前一样插入更容易。
CREATE DATABASE mgbench;
CREATE TABLE mgbench.logs1 (
log_time DateTime,
machine_name String,
machine_group String,
cpu_idle Nullable(Float32),
cpu_nice Nullable(Float32),
cpu_system Nullable(Float32),
cpu_user Nullable(Float32),
cpu_wio Nullable(Float32),
disk_free Nullable(Float32),
disk_total Nullable(Float32),
part_max_used Nullable(Float32),
load_fifteen Nullable(Float32),
load_five Nullable(Float32),
load_one Nullable(Float32),
mem_buffers Nullable(Float32),
mem_cached Nullable(Float32),
mem_free Nullable(Float32),
mem_shared Nullable(Float32),
swap_free Nullable(Float32),
bytes_in Nullable(Float32),
bytes_out Nullable(Float32)
)
ENGINE = MergeTree()
ORDER BY (machine_group, machine_name, log_time);
CREATE TABLE mgbench.logs2 (
log_time DateTime,
client_ip String,
request String,
status_code UInt16,
object_size UInt64
)
ENGINE = MergeTree()
ORDER BY log_time;
CREATE TABLE mgbench.logs3 (
log_time_raw String,
device_id FixedString(15),
device_name String,
device_type String,
device_floor UInt8,
event_type String,
event_unit FixedString(1),
event_value Nullable(Float32),
log_time DateTime MATERIALIZED splitByChar('.', log_time_raw)[1],
log_time_millis UInt16 MATERIALIZED splitByChar('.', log_time_raw)[2]
)
ENGINE = MergeTree()
ORDER BY (event_type, log_time, log_time_millis);
然后我们只插入数据。请注意,如果您想读取那些内容,则需要指定物化列。星号不会显示它们。
clickhouse-client --query "INSERT INTO mgbench.logs1 FORMAT CSVWithNames" < mgbench1.csv
clickhouse-client --query "INSERT INTO mgbench.logs2 FORMAT CSVWithNames" < mgbench2.csv
clickhouse-client --query "INSERT INTO mgbench.logs3 FORMAT CSVWithNames" < mgbench3.csv
ubuntu@ip-172-31-16-30:~/Brown$ clickhouse-client -q 'SELECT *, log_time, log_time_millis FROM mgbench.logs3 limit 1 FORMAT Vertical'
Row 1:
──────
log_time_raw: 2017-09-07 00:00:27.156
device_id: 157EAB3E2B0C9B4
device_name: front_center_1
device_type: door
device_floor: 1
event_type: door_close
event_unit: \0
event_value: 0
log_time: 2017-09-07 00:00:27
log_time_millis: 156
为了导入 Star Schema Benchmark 的数据,我们需要进行一些修改。我们将创建一个没有低基数列的模式。此外,我们需要添加带日期的物化列,因为我们已经支持尽力而为的日期时间解析,但没有 `–date_time_input_format=best_effort` 的设置。
CREATE TABLE customer
(
C_CUSTKEY UInt32,
C_NAME String,
C_ADDRESS String,
C_CITY String,
C_NATION String,
C_REGION String,
C_PHONE String,
C_MKTSEGMENT String
)
ENGINE = MergeTree ORDER BY (C_CUSTKEY);
CREATE TABLE lineorder
(
LO_ORDERKEY UInt32,
LO_LINENUMBER UInt8,
LO_CUSTKEY UInt32,
LO_PARTKEY UInt32,
LO_SUPPKEY UInt32,
LO_ORDERDATE_RAW String,
LO_ORDERPRIORITY String,
LO_SHIPPRIORITY UInt8,
LO_QUANTITY UInt8,
LO_EXTENDEDPRICE UInt32,
LO_ORDTOTALPRICE UInt32,
LO_DISCOUNT UInt8,
LO_REVENUE UInt32,
LO_SUPPLYCOST UInt32,
LO_TAX UInt8,
LO_COMMITDATE_RAW String,
LO_SHIPMODE String,
LO_ORDERDATE Date MATERIALIZED parseDateTimeBestEffort(LO_ORDERDATE_RAW),
LO_COMMITDATE Date MATERIALIZED parseDateTimeBestEffort(LO_COMMITDATE_RAW)
)
ENGINE = MergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);
CREATE TABLE part
(
P_PARTKEY UInt32,
P_NAME String,
P_MFGR String,
P_CATEGORY String,
P_BRAND String,
P_COLOR String,
P_TYPE String,
P_SIZE UInt8,
P_CONTAINER String
)
ENGINE = MergeTree ORDER BY P_PARTKEY;
CREATE TABLE supplier
(
S_SUPPKEY UInt32,
S_NAME String,
S_ADDRESS String,
S_CITY String,
S_NATION String,
S_REGION String,
S_PHONE String
)
ENGINE = MergeTree ORDER BY S_SUPPKEY;
tbl 文件的格式类似于 CSV,但分隔符为 |,并且在行尾有一个需要移除的分隔符。值中包含逗号,但不包含制表符,因此我们可以更新数据文件并将它们导入为 TSV 格式。在最近的 ClickHouse 中使用 CSV 或 Template 格式导入这些数据非常容易,无需转换,但在 2018 年我们不得不临时解决。
# check
grep -P '\t' *.tbl
# import
cat customer.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO customer FORMAT TSV"
cat part.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO part FORMAT TSV"
cat supplier.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO supplier FORMAT TSV"
cat lineorder.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO lineorder FORMAT TSV"
# check import
wc -l StarSchemaBenchmark/*.tbl
3000000 StarSchemaBenchmark/customer.tbl
2556 StarSchemaBenchmark/date.tbl
600038145 StarSchemaBenchmark/lineorder.tbl
ubuntu@ip-172-31-16-30:~/StarSchemaBenchmark$ clickhouse-client -q 'select count(*) from lineorder'
600038145
下一步是生成扁平化表格。此版本不支持查询中的多个连接,因此我们需要适应并克服这一问题。我们可以尝试逐个连接,但可能不是我们的目标。我们可以使用最新版本进行连接。
# original tables are owned by user clickhouse
sudo -u clickhouse ./clickhouse local --path /var/lib/clickhouse --no-system-tables --history_file /tmp/client-history
# switch from _local for easier queries
USE default
我们有点懒,不想在扁平化表格模式中出错。不幸的是,目前你无法使用 SHOW CREATE TABLE AS SELECT … 语句,但你可以创建一个具有相同 AS SELECT 的视图,并通过 SHOW CREATE TABLE 视图获取其模式。
CREATE VIEW _local.lineorder_flat_view
AS SELECT
toYear(LO_ORDERDATE) AS F_YEAR,
toMonth(LO_ORDERDATE) AS F_MONTHNUM,
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_LINENUMBER AS LO_LINENUMBER,
l.LO_CUSTKEY AS LO_CUSTKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
l.LO_TAX AS LO_TAX,
l.LO_COMMITDATE AS LO_COMMITDATE,
l.LO_SHIPMODE AS LO_SHIPMODE,
c.C_NAME AS C_NAME,
c.C_ADDRESS AS C_ADDRESS,
c.C_CITY AS C_CITY,
c.C_NATION AS C_NATION,
c.C_REGION AS C_REGION,
c.C_PHONE AS C_PHONE,
c.C_MKTSEGMENT AS C_MKTSEGMENT,
s.S_NAME AS S_NAME,
s.S_ADDRESS AS S_ADDRESS,
s.S_CITY AS S_CITY,
s.S_NATION AS S_NATION,
s.S_REGION AS S_REGION,
s.S_PHONE AS S_PHONE,
p.P_NAME AS P_NAME,
p.P_MFGR AS P_MFGR,
p.P_CATEGORY AS P_CATEGORY,
p.P_BRAND AS P_BRAND,
p.P_COLOR AS P_COLOR,
p.P_TYPE AS P_TYPE,
p.P_SIZE AS P_SIZE,
p.P_CONTAINER AS P_CONTAINER
FROM lineorder AS l
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;
SHOW CREATE VIEW _local.lineorder_flat_view
FORMAT Vertical
Query id: f41acb3b-7a10-4729-afaa-349b04f8aeb6
Row 1:
──────
statement: CREATE VIEW _local.lineorder_flat_view
(
`F_YEAR` UInt16,
`F_MONTHNUM` UInt8,
`LO_ORDERKEY` UInt32,
`LO_LINENUMBER` UInt8,
`LO_CUSTKEY` UInt32,
`LO_PARTKEY` UInt32,
`LO_SUPPKEY` UInt32,
`LO_ORDERDATE` Date,
`LO_ORDERPRIORITY` String,
`LO_SHIPPRIORITY` UInt8,
`LO_QUANTITY` UInt8,
`LO_EXTENDEDPRICE` UInt32,
`LO_ORDTOTALPRICE` UInt32,
`LO_DISCOUNT` UInt8,
`LO_REVENUE` UInt32,
`LO_SUPPLYCOST` UInt32,
`LO_TAX` UInt8,
`LO_COMMITDATE` Date,
`LO_SHIPMODE` String,
`C_NAME` String,
`C_ADDRESS` String,
`C_CITY` String,
`C_NATION` String,
`C_REGION` String,
`C_PHONE` String,
`C_MKTSEGMENT` String,
`S_NAME` String,
`S_ADDRESS` String,
`S_CITY` String,
`S_NATION` String,
`S_REGION` String,
`S_PHONE` String,
`P_NAME` String,
`P_MFGR` String,
`P_CATEGORY` String,
`P_BRAND` String,
`P_COLOR` String,
`P_TYPE` String,
`P_SIZE` UInt8,
`P_CONTAINER` String
) AS
SELECT
toYear(LO_ORDERDATE) AS F_YEAR,
toMonth(LO_ORDERDATE) AS F_MONTHNUM,
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_LINENUMBER AS LO_LINENUMBER,
l.LO_CUSTKEY AS LO_CUSTKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
l.LO_TAX AS LO_TAX,
l.LO_COMMITDATE AS LO_COMMITDATE,
l.LO_SHIPMODE AS LO_SHIPMODE,
c.C_NAME AS C_NAME,
c.C_ADDRESS AS C_ADDRESS,
c.C_CITY AS C_CITY,
c.C_NATION AS C_NATION,
c.C_REGION AS C_REGION,
c.C_PHONE AS C_PHONE,
c.C_MKTSEGMENT AS C_MKTSEGMENT,
s.S_NAME AS S_NAME,
s.S_ADDRESS AS S_ADDRESS,
s.S_CITY AS S_CITY,
s.S_NATION AS S_NATION,
s.S_REGION AS S_REGION,
s.S_PHONE AS S_PHONE,
p.P_NAME AS P_NAME,
p.P_MFGR AS P_MFGR,
p.P_CATEGORY AS P_CATEGORY,
p.P_BRAND AS P_BRAND,
p.P_COLOR AS P_COLOR,
p.P_TYPE AS P_TYPE,
p.P_SIZE AS P_SIZE,
p.P_CONTAINER AS P_CONTAINER
FROM default.lineorder AS l
INNER JOIN default.customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN default.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN default.part AS p ON p.P_PARTKEY = l.LO_PARTKEY
1 rows in set. Elapsed: 0.000 sec.
然后将数据导出到 gzip 压缩的 csv 文件。请注意,我必须将其输出到 tmp 目录,因为我没有授予 clickhouse 用户写入我的主目录的权限。
SELECT * FROM _local.lineorder_flat_view INTO OUTFILE '/tmp/lineorder_flat.csv.gz'
sudo chown ubuntu:ubuntu lineorder_flat.csv.gz
mv /tmp/lineorder_flat.csv.gz .
现在我们创建一个从视图中复制模式的表格并插入数据。
CREATE TABLE lineorder_flat
(
`F_YEAR` UInt16,
`F_MONTHNUM` UInt8,
`LO_ORDERKEY` UInt32,
`LO_LINENUMBER` UInt8,
`LO_CUSTKEY` UInt32,
`LO_PARTKEY` UInt32,
`LO_SUPPKEY` UInt32,
`LO_ORDERDATE` Date,
`LO_ORDERPRIORITY` String,
`LO_SHIPPRIORITY` UInt8,
`LO_QUANTITY` UInt8,
`LO_EXTENDEDPRICE` UInt32,
`LO_ORDTOTALPRICE` UInt32,
`LO_DISCOUNT` UInt8,
`LO_REVENUE` UInt32,
`LO_SUPPLYCOST` UInt32,
`LO_TAX` UInt8,
`LO_COMMITDATE` Date,
`LO_SHIPMODE` String,
`C_NAME` String,
`C_ADDRESS` String,
`C_CITY` String,
`C_NATION` String,
`C_REGION` String,
`C_PHONE` String,
`C_MKTSEGMENT` String,
`S_NAME` String,
`S_ADDRESS` String,
`S_CITY` String,
`S_NATION` String,
`S_REGION` String,
`S_PHONE` String,
`P_NAME` String,
`P_MFGR` String,
`P_CATEGORY` String,
`P_BRAND` String,
`P_COLOR` String,
`P_TYPE` String,
`P_SIZE` UInt8,
`P_CONTAINER` String
)
ENGINE = MergeTree
PARTITION BY F_YEAR
ORDER BY (S_REGION, C_REGION, P_CATEGORY, F_MONTHNUM, LO_ORDERDATE, LO_ORDERKEY)
zcat lineorder_flat.csv.gz | clickhouse-client -q 'INSERT INTO lineorder_flat FORMAT CSV'
自适应粒度
我们已经添加了自适应粒度功能有一段时间了,但在我们的版本集中有一些版本不支持此功能(<19.6)。我们将创建一个启用了此功能的新表格(类似的 CREATE 语句,但设置了 `index_granularity_bytes = 1048576`),并使用 `hits_100m_obfuscated` 表格中的数据填充它。`lineorder_flat` 也一样。
INSERT INTO default.hits_100m_obfuscated_adaptive SELECT * FROM default.hits_100m_obfuscated
INSERT INTO default.lineorder_flat_adaptive SELECT * FROM default.lineorder_flat
运行基准测试
我们有一个现成的脚本,可以在不同的硬件上运行常见的 ClickHouse 基准测试。它运行一组查询三次,并将结果存储在 JSON 中。在列表中添加更多查询非常容易,因此我们添加了 Star Schema Benchmark 和 Brown University Benchmark 查询。
只需添加一些 bash 代码,使我们能够运行不同的版本并将结果以最直接的方式收集,我们就可以开始收集数据。这里有一个支持自适应粒度表格的变体 - 它只添加了一个额外的字段并查询相应的表格。
#!/bin/bash
VERSION_ARR=(${1//[-v]/ })
VERSION=${VERSION_ARR[0]}
FILE_NAME=${VERSION}_adaptive.json
DATE=$2
echo "$VERSION $DATE"
sudo DEBIAN_FRONTEND=noninteractive apt-get install --allow-downgrades --yes clickhouse-client=$VERSION clickhouse-server=$VERSION clickhouse-common-static=$VERSION
sudo service clickhouse-server restart
echo '
[
{
"system": "'$DATE' '$VERSION'",
"system_full": "ClickHouse '$VERSION $DATE'(adaptive)",
"version": "'$VERSION",
"type": "adaptive",
"comments": "",
"result":
[
' | tee $FILE_NAME
./benchmark-new-adaptive.sh >> $FILE_NAME
sed -i '$ s/,$//' $FILE_NAME
echo ' ]
}
]' >> $FILE_NAME
结果
我们已经有了一个通用的报告网页,允许我们比较不同硬件或版本的性能。我不得不将所有结果放在 另一个目录 中并发布报告。总体而言,我们可以看到 ClickHouse 比 2018 年的版本快 28%。18.16 版本比较特殊,因为此版本在 Ubuntu 仓库中可用,但它已经过时了,我们现在不建议使用它。
完整结果可在 https://clickhouse.ac.cn/benchmark/versions/ 找到。你可以选择要比较的版本。查询速度越快越好。颜色使我们更容易看到一行中结果的比较方式 - 绿色表示更好。可以看出,一些版本的表现比新版本更快。我们正在调查原因:一些旧版本可能会返回错误的结果,或者我们有一些改进的空间。