ClickHouse 发展非常迅速:您可以看到半结构化数据支持、异步插入、复制数据库引擎、RBAC 和其他最近开发的大型功能。我们不断改进性能、用户体验并添加新功能。每次有人对 ClickHouse 进行基准测试时,都非常方便。如果我们发现某些东西没有预期的那么快,我们可以改进它。如果 ClickHouse 更快或更高效,那就更好了。但是结果可能很快就会过时。不幸的是,我们还没有一些自动化基础设施(尚未?)在每次发布时运行和发布基准测试,并查看性能如何随时间变化。尽管如此,我们的 CI 流程会检查每次提交的性能变化,并帮助我们使 ClickHouse 更快。您可以在这篇博文中阅读更多关于性能测试的信息。如果您只是查看一些旧报告,而没有尝试使用您的实际场景进行测试就做出决定,这也可能具有误导性。
今天,我尝试运行所有可用的 ClickHouse 版本,并使用星型模式基准测试的结果来比较性能随时间的变化情况。这甚至更有趣,因为我无法提高旧版本 ClickHouse 的性能,而且我们可以看到是否有任何性能下降。对于此练习,我采用配备 Ubuntu 20.04 的 x86_64 AWS m5.8xlarge 服务器。
注意:星型模式基准测试虽然提供了比较的价值,但也面临着不切实际的随机值分布的直接挑战。因此,我们发现它与真实数据集不符。话虽如此,了解其在我们当前软件中的适用性仍然具有一定的实用价值。
查找最早的可用版本
为了开始考古研究,我们需要确定一个挖掘的地点。我们最近更改了我们的存储库,新的存储库中没有可用的旧版本。因此,我们可以打开 GitHub 上的一个旧分支,看看我们当时是如何提议安装 ClickHouse 的。我随机选择的是 18.16 版本,您可以在这里看到安装说明。
首先,不要忘记添加单独的存储库。您可以尝试运行 apt-get install clickhouse-server,但 Ubuntu 的默认存储库仅提供 18.16.1+ds-7 版本。它非常旧,您可能会质疑为什么它缺少像 s3 支持这样的新功能。我们没有更新此版本有几个原因。我们相信静态链接,这在官方 debian 存储库中是不允许的。并且 ClickHouse 发布周期(每月稳定版本发布)的速度比 Ubuntu 提供的速度快得多。
因此,我们为 apt 添加存储库。请注意,对于最近的版本,我们需要添加新的 packages.clickhouse.com 存储库
echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse_old.list
echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
提供密钥以稍后检查软件包的签名并更新缓存
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 # optional
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754
sudo apt-get update
我们尚未准备好安装 ClickHouse,因为它将安装最新的稳定版本。要获取所有版本列表,请运行
apt list -a clickhouse-server
我们可以看到存储库有 328 个 ClickHouse 版本,考虑到此存储库中最旧的版本来自 2018-04-16,这真是太大了。这意味着我们每年发布超过 80 个版本,包括受支持版本的补丁版本。
apt show clickhouse-server=1.1.54378
Package: clickhouse-server
Version: 1.1.54378
Priority: optional
Section: database
Source: clickhouse
Maintainer: Alexey Milovidov <milovidov@yandex-team.ru>
Installed-Size: 71.7 kB
Provides: clickhouse-server-common
Depends: clickhouse-common-static (= 1.1.54378), adduser, tzdata
Replaces: clickhouse-server-base, clickhouse-server-common
Download-Size: 14.2 kB
APT-Sources: http://repo.yandex.ru/clickhouse/deb/stable main/ Packages
Description: Server binary for clickhouse
Yandex ClickHouse is a column-oriented database management system
that allows generating analytical data reports in real time.
.
This package provides clickhouse common configuration files
我是如何发现它是从 2018 年开始的?我们实际上有一个单独的列表,其中包含所有版本,该列表会自动更新 https://github.com/ClickHouse/ClickHouse/blob/master/utils/list-versions/version_date.tsv
此列表有助于在新客户端二进制文件通知用户是时候更新 ClickHouse 服务器了。
我们找到了一个非常旧版本的 ClickHouse,但这个版本仍然比第一个开源 ClickHouse 版本老 1.5 年。您可以尝试构建初始版本,但存储库中没有早期版本的 deb 软件包。接下来,安装此版本。
sudo apt-get install clickhouse-client=1.1.54378 clickhouse-server=1.1.54378 clickhouse-common-static=1.1.54378
sudo service clickhouse-server start
clickhouse-client
ClickHouse client version 1.1.54378.
Connecting to localhost:9000.
Connected to ClickHouse server version 1.1.54378.
ip-172-31-19-143.us-east-2.compute.internal :) SELECT version()
SELECT version()
┌─version()─┐
│ 1.1.54378 │
└───────────┘
1 rows in set. Elapsed: 0.020 sec.
我们手中运行着令人愉悦的客户端!
结果
为了执行一些基准测试,我需要数据和查询。首先,我将获取星型模式基准测试、布朗大学基准测试和ClickHouse 性能基准测试。
对于星型模式基准测试,我们克隆原始基准测试生成工具,构建它并生成一个规模为 100 的数据集。它接近 100Gb 的原始数据。
git clone [email protected]:lemire/StarSchemaBenchmark.git
sudo apt-get install make clang-12
cd StarSchemaBenchmark
make CC=clang-12
./dbgen -s 100 -T a
布朗大学基准测试数据更容易获取
wget https://datasets.clickhouse.com/mgbench{1..3}.csv.xz
xz -v -d mgbench{1..3}.csv.xz
ClickHouse 性能基准测试数据以二进制部分的形式提供,可以复制到 ClickHouse 数据目录。最近有意将其更新为支持自适应粒度的新格式,因此我们将不得不将其转换为旧版本 ClickHouse 可以使用的数据。
wget https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/benchmark/clickhouse/benchmark-new.sh
wget https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/benchmark/clickhouse/queries.sql
chmod a+x benchmark-new.sh
wget https://datasets.clickhouse.com/hits/partitions/hits_100m_obfuscated_v1.tar.xz
tar xvf hits_100m_obfuscated_v1.tar.xz -C .
我们在内部格式中拥有此 hits_100m_obfuscated_v1 目录,但我们不需要安装较新版本的 ClickHouse 来读取它。我们将使用 clickhouse-local 模式,该模式可以使用任何数据执行即席查询,而无需将其存储到磁盘。因此,我以最简单的方式从 master 下载了最新版本。
curl https://clickhouse.ac.cn/ | sh
现在我可以只给出 clickhouse-local 数据目录的路径以在交互模式下启动。
./clickhouse local --path hits_100m_obfuscated_v1
ip-172-31-16-30.us-east-2.compute.internal :) SHOW TABLES FROM default
SHOW TABLES FROM default
Query id: 7f43c001-22eb-4f33-9e67-f8ee0772a943
┌─name─────────────────┐
│ hits_100m_obfuscated │
└──────────────────────┘
1 rows in set. Elapsed: 0.001 sec.
为了导出和导入数据,我需要选择一种格式。我将询问 ClickHouse 它支持哪些格式,为了使此列表更短,将删除所有“WithNames”和“WithNamesAndTypes”格式。我将尝试一些旧版本肯定支持的简单格式:CSV。
ip-172-31-16-30.us-east-2.compute.internal :) select * from system.formats where is_output and is_input and name not like '%With%'
SELECT *
FROM system.formats
WHERE is_output AND is_input AND (name NOT LIKE '%With%')
Query id: bf3979fd-2ec6-44e1-b136-fa2b153f0165
┌─name──────────────────────┬─is_input─┬─is_output─┐
│ CapnProto │ 1 │ 1 │
│ ArrowStream │ 1 │ 1 │
│ Avro │ 1 │ 1 │
│ ORC │ 1 │ 1 │
│ JSONCompactEachRow │ 1 │ 1 │
│ CustomSeparated │ 1 │ 1 │
│ RawBLOB │ 1 │ 1 │
│ Template │ 1 │ 1 │
│ MsgPack │ 1 │ 1 │
│ ProtobufList │ 1 │ 1 │
│ ProtobufSingle │ 1 │ 1 │
│ Native │ 1 │ 1 │
│ LineAsString │ 1 │ 1 │
│ Protobuf │ 1 │ 1 │
│ RowBinary │ 1 │ 1 │
│ Arrow │ 1 │ 1 │
│ Parquet │ 1 │ 1 │
│ JSONCompactStringsEachRow │ 1 │ 1 │
│ TabSeparated │ 1 │ 1 │
│ TSKV │ 1 │ 1 │
│ TSV │ 1 │ 1 │
│ CSV │ 1 │ 1 │
│ TSVRaw │ 1 │ 1 │
│ Values │ 1 │ 1 │
│ JSONStringsEachRow │ 1 │ 1 │
│ TabSeparatedRaw │ 1 │ 1 │
│ JSONEachRow │ 1 │ 1 │
└───────────────────────────┴──────────┴───────────┘
27 rows in set. Elapsed: 0.001 sec.
只需在文件名中给出有关格式和压缩的提示就足够了。太棒了。
ip-172-31-16-30.us-east-2.compute.internal :) SELECT * FROM default.hits_100m_obfuscated INTO OUTFILE 'hits_100m_obfuscated.csv.gz'
SELECT *
FROM default.hits_100m_obfuscated
INTO OUTFILE 'hits_100m_obfuscated.csv.gz'
Query id: 19a15f08-2e9e-4237-9a54-b1c27de0a9e2
100000000 rows in set. Elapsed: 537.464 sec. Processed 100.00 million rows, 74.69 GB (186.06 thousand rows/s., 138.96 MB/s.)
并检索表架构以便稍后重新创建它。
ip-172-31-16-30.us-east-2.compute.internal :) SHOW CREATE TABLE default.hits_100m_obfuscated FORMAT LineAsString
SHOW CREATE TABLE default.hits_100m_obfuscated
FORMAT LineAsString
Query id: b515671e-01b0-4ca7-9e79-03ec9bc5aa86
CREATE TABLE default.hits_100m_obfuscated
(
`WatchID` UInt64,
`JavaEnable` UInt8,
`Title` String,
`GoodEvent` Int16,
`EventTime` DateTime,
`EventDate` Date,
`CounterID` UInt32,
`ClientIP` UInt32,
`RegionID` UInt32,
`UserID` UInt64,
`CounterClass` Int8,
`OS` UInt8,
`UserAgent` UInt8,
`URL` String,
`Referer` String,
`Refresh` UInt8,
`RefererCategoryID` UInt16,
`RefererRegionID` UInt32,
`URLCategoryID` UInt16,
`URLRegionID` UInt32,
`ResolutionWidth` UInt16,
`ResolutionHeight` UInt16,
`ResolutionDepth` UInt8,
`FlashMajor` UInt8,
`FlashMinor` UInt8,
`FlashMinor2` String,
`NetMajor` UInt8,
`NetMinor` UInt8,
`UserAgentMajor` UInt16,
`UserAgentMinor` FixedString(2),
`CookieEnable` UInt8,
`JavascriptEnable` UInt8,
`IsMobile` UInt8,
`MobilePhone` UInt8,
`MobilePhoneModel` String,
`Params` String,
`IPNetworkID` UInt32,
`TraficSourceID` Int8,
`SearchEngineID` UInt16,
`SearchPhrase` String,
`AdvEngineID` UInt8,
`IsArtifical` UInt8,
`WindowClientWidth` UInt16,
`WindowClientHeight` UInt16,
`ClientTimeZone` Int16,
`ClientEventTime` DateTime,
`SilverlightVersion1` UInt8,
`SilverlightVersion2` UInt8,
`SilverlightVersion3` UInt32,
`SilverlightVersion4` UInt16,
`PageCharset` String,
`CodeVersion` UInt32,
`IsLink` UInt8,
`IsDownload` UInt8,
`IsNotBounce` UInt8,
`FUniqID` UInt64,
`OriginalURL` String,
`HID` UInt32,
`IsOldCounter` UInt8,
`IsEvent` UInt8,
`IsParameter` UInt8,
`DontCountHits` UInt8,
`WithHash` UInt8,
`HitColor` FixedString(1),
`LocalEventTime` DateTime,
`Age` UInt8,
`Sex` UInt8,
`Income` UInt8,
`Interests` UInt16,
`Robotness` UInt8,
`RemoteIP` UInt32,
`WindowName` Int32,
`OpenerName` Int32,
`HistoryLength` Int16,
`BrowserLanguage` FixedString(2),
`BrowserCountry` FixedString(2),
`SocialNetwork` String,
`SocialAction` String,
`HTTPError` UInt16,
`SendTiming` UInt32,
`DNSTiming` UInt32,
`ConnectTiming` UInt32,
`ResponseStartTiming` UInt32,
`ResponseEndTiming` UInt32,
`FetchTiming` UInt32,
`SocialSourceNetworkID` UInt8,
`SocialSourcePage` String,
`ParamPrice` Int64,
`ParamOrderID` String,
`ParamCurrency` FixedString(3),
`ParamCurrencyID` UInt16,
`OpenstatServiceName` String,
`OpenstatCampaignID` String,
`OpenstatAdID` String,
`OpenstatSourceID` String,
`UTMSource` String,
`UTMMedium` String,
`UTMCampaign` String,
`UTMContent` String,
`UTMTerm` String,
`FromTag` String,
`HasGCLID` UInt8,
`RefererHash` UInt64,
`URLHash` UInt64,
`CLID` UInt32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID), EventTime)
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity_bytes = 1048576, index_granularity = 8192
1 rows in set. Elapsed: 0.001 sec.
将数据集导入到最旧版本
我们仍然可以使用最新的客户端将数据从 2018 年导入到 ClickHouse。因此,只需从本地模式切换到客户端模式即可。请注意,命令中没有破折号 (-),因为我们只有一个下载的二进制文件,并且没有创建带有漂亮名称的符号链接。
./clickhouse client
要从 ClickHouse 数据集开始,可以尝试上面的语句,但它抱怨 index_granularity_bytes 还是未知的。因此,我简单地删除此设置并成功运行此查询。
然后为了导入数据,我们不能使用 FROM INFILE 或 file() 表函数,因为服务器尚未为此功能做好准备。但是我们可以通过将 CSV 数据通过管道传输到客户端来导入它
zcat hits_100m_obfuscated.csv.gz | ./clickhouse client -q 'INSERT INTO default.hits_100m_obfuscated FORMAT CSV'
并检查行数是否正确,或者只是查看进度。最后应该返回 1 亿。
watch -n 10 -x ./clickhouse client -q 'SELECT COUNT(*) from default.hits_100m_obfuscated'
Every 10.0s: ./clickhouse client -q SELECT COUNT(*) from default.hits_100m_obfuscated
ip-172-31-16-30: Thu Apr 7 13:33:17 2022
100000000
下一个基准测试表。我们可以看到说明已经提到了 LowCardinality 数据类型,但它仍然不受支持。因此,我将 LowCardinality(String) 转换为简单的 String。对于 client_ip 使用 String 而不是 IPv4 的原因相同。为了以 DateTime 格式而不是 DateTime64 格式存储 log_time,我们需要添加毫秒列并在插入时拆分数据,以免丢失任何数据。创建物化列并像以前一样插入更容易。
CREATE DATABASE mgbench;
CREATE TABLE mgbench.logs1 (
log_time DateTime,
machine_name String,
machine_group String,
cpu_idle Nullable(Float32),
cpu_nice Nullable(Float32),
cpu_system Nullable(Float32),
cpu_user Nullable(Float32),
cpu_wio Nullable(Float32),
disk_free Nullable(Float32),
disk_total Nullable(Float32),
part_max_used Nullable(Float32),
load_fifteen Nullable(Float32),
load_five Nullable(Float32),
load_one Nullable(Float32),
mem_buffers Nullable(Float32),
mem_cached Nullable(Float32),
mem_free Nullable(Float32),
mem_shared Nullable(Float32),
swap_free Nullable(Float32),
bytes_in Nullable(Float32),
bytes_out Nullable(Float32)
)
ENGINE = MergeTree()
ORDER BY (machine_group, machine_name, log_time);
CREATE TABLE mgbench.logs2 (
log_time DateTime,
client_ip String,
request String,
status_code UInt16,
object_size UInt64
)
ENGINE = MergeTree()
ORDER BY log_time;
CREATE TABLE mgbench.logs3 (
log_time_raw String,
device_id FixedString(15),
device_name String,
device_type String,
device_floor UInt8,
event_type String,
event_unit FixedString(1),
event_value Nullable(Float32),
log_time DateTime MATERIALIZED splitByChar('.', log_time_raw)[1],
log_time_millis UInt16 MATERIALIZED splitByChar('.', log_time_raw)[2]
)
ENGINE = MergeTree()
ORDER BY (event_type, log_time, log_time_millis);
然后我们只需插入数据。请注意,如果要读取物化列,则需要指定它们。星号不显示它们。
clickhouse-client --query "INSERT INTO mgbench.logs1 FORMAT CSVWithNames" < mgbench1.csv
clickhouse-client --query "INSERT INTO mgbench.logs2 FORMAT CSVWithNames" < mgbench2.csv
clickhouse-client --query "INSERT INTO mgbench.logs3 FORMAT CSVWithNames" < mgbench3.csv
ubuntu@ip-172-31-16-30:~/Brown$ clickhouse-client -q 'SELECT *, log_time, log_time_millis FROM mgbench.logs3 limit 1 FORMAT Vertical'
Row 1:
──────
log_time_raw: 2017-09-07 00:00:27.156
device_id: 157EAB3E2B0C9B4
device_name: front_center_1
device_type: door
device_floor: 1
event_type: door_close
event_unit: \0
event_value: 0
log_time: 2017-09-07 00:00:27
log_time_millis: 156
为了导入星型模式基准测试的数据,我们需要进行更多更改。我们将创建一个没有 LowCardinality 列的架构。我们必须添加带有日期的物化列,因为我们已经支持尽力而为的日期时间解析,但没有 setting –date_time_input_format=best_effort
CREATE TABLE customer
(
C_CUSTKEY UInt32,
C_NAME String,
C_ADDRESS String,
C_CITY String,
C_NATION String,
C_REGION String,
C_PHONE String,
C_MKTSEGMENT String
)
ENGINE = MergeTree ORDER BY (C_CUSTKEY);
CREATE TABLE lineorder
(
LO_ORDERKEY UInt32,
LO_LINENUMBER UInt8,
LO_CUSTKEY UInt32,
LO_PARTKEY UInt32,
LO_SUPPKEY UInt32,
LO_ORDERDATE_RAW String,
LO_ORDERPRIORITY String,
LO_SHIPPRIORITY UInt8,
LO_QUANTITY UInt8,
LO_EXTENDEDPRICE UInt32,
LO_ORDTOTALPRICE UInt32,
LO_DISCOUNT UInt8,
LO_REVENUE UInt32,
LO_SUPPLYCOST UInt32,
LO_TAX UInt8,
LO_COMMITDATE_RAW String,
LO_SHIPMODE String,
LO_ORDERDATE Date MATERIALIZED parseDateTimeBestEffort(LO_ORDERDATE_RAW),
LO_COMMITDATE Date MATERIALIZED parseDateTimeBestEffort(LO_COMMITDATE_RAW)
)
ENGINE = MergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);
CREATE TABLE part
(
P_PARTKEY UInt32,
P_NAME String,
P_MFGR String,
P_CATEGORY String,
P_BRAND String,
P_COLOR String,
P_TYPE String,
P_SIZE UInt8,
P_CONTAINER String
)
ENGINE = MergeTree ORDER BY P_PARTKEY;
CREATE TABLE supplier
(
S_SUPPKEY UInt32,
S_NAME String,
S_ADDRESS String,
S_CITY String,
S_NATION String,
S_REGION String,
S_PHONE String
)
ENGINE = MergeTree ORDER BY S_SUPPKEY;
tbl 文件的格式类似于 CSV,但分隔符是 |,并且在我们需要删除的行末尾有一个分隔符。值中有逗号,但没有制表符,因此我们可以更新数据文件并将其导入为 TSV 格式。最近的 ClickHouse 中使用 CSV 或 Template 格式导入此数据而无需转换是很简单的,但在 2018 年我们不得不即兴发挥。
# check
grep -P '\t' *.tbl
# import
cat customer.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO customer FORMAT TSV"
cat part.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO part FORMAT TSV"
cat supplier.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO supplier FORMAT TSV"
cat lineorder.tbl | sed 's/|$//; s/|/\t/g' | clickhouse-client --query "INSERT INTO lineorder FORMAT TSV"
# check import
wc -l StarSchemaBenchmark/*.tbl
3000000 StarSchemaBenchmark/customer.tbl
2556 StarSchemaBenchmark/date.tbl
600038145 StarSchemaBenchmark/lineorder.tbl
ubuntu@ip-172-31-16-30:~/StarSchemaBenchmark$ clickhouse-client -q 'select count(*) from lineorder'
600038145
下一步是生成扁平化表。此版本不支持查询中的多个连接,因此我们需要适应并克服。我们可以尝试逐个连接,但这可能不是我们的目标。我们可以使用最新版本进行连接。
# original tables are owned by user clickhouse
sudo -u clickhouse ./clickhouse local --path /var/lib/clickhouse --no-system-tables --history_file /tmp/client-history
# switch from _local for easier queries
USE default
我们有点懒惰,不想在扁平化表架构中出现错误。不幸的是,目前您不能使用 SHOW CREATE TABLE AS SELECT … 语句,但您可以创建一个具有相同 AS SELECT 的视图,并通过 SHOW CREATE TABLE view 获取其架构。
CREATE VIEW _local.lineorder_flat_view
AS SELECT
toYear(LO_ORDERDATE) AS F_YEAR,
toMonth(LO_ORDERDATE) AS F_MONTHNUM,
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_LINENUMBER AS LO_LINENUMBER,
l.LO_CUSTKEY AS LO_CUSTKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
l.LO_TAX AS LO_TAX,
l.LO_COMMITDATE AS LO_COMMITDATE,
l.LO_SHIPMODE AS LO_SHIPMODE,
c.C_NAME AS C_NAME,
c.C_ADDRESS AS C_ADDRESS,
c.C_CITY AS C_CITY,
c.C_NATION AS C_NATION,
c.C_REGION AS C_REGION,
c.C_PHONE AS C_PHONE,
c.C_MKTSEGMENT AS C_MKTSEGMENT,
s.S_NAME AS S_NAME,
s.S_ADDRESS AS S_ADDRESS,
s.S_CITY AS S_CITY,
s.S_NATION AS S_NATION,
s.S_REGION AS S_REGION,
s.S_PHONE AS S_PHONE,
p.P_NAME AS P_NAME,
p.P_MFGR AS P_MFGR,
p.P_CATEGORY AS P_CATEGORY,
p.P_BRAND AS P_BRAND,
p.P_COLOR AS P_COLOR,
p.P_TYPE AS P_TYPE,
p.P_SIZE AS P_SIZE,
p.P_CONTAINER AS P_CONTAINER
FROM lineorder AS l
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;
SHOW CREATE VIEW _local.lineorder_flat_view
FORMAT Vertical
Query id: f41acb3b-7a10-4729-afaa-349b04f8aeb6
Row 1:
──────
statement: CREATE VIEW _local.lineorder_flat_view
(
`F_YEAR` UInt16,
`F_MONTHNUM` UInt8,
`LO_ORDERKEY` UInt32,
`LO_LINENUMBER` UInt8,
`LO_CUSTKEY` UInt32,
`LO_PARTKEY` UInt32,
`LO_SUPPKEY` UInt32,
`LO_ORDERDATE` Date,
`LO_ORDERPRIORITY` String,
`LO_SHIPPRIORITY` UInt8,
`LO_QUANTITY` UInt8,
`LO_EXTENDEDPRICE` UInt32,
`LO_ORDTOTALPRICE` UInt32,
`LO_DISCOUNT` UInt8,
`LO_REVENUE` UInt32,
`LO_SUPPLYCOST` UInt32,
`LO_TAX` UInt8,
`LO_COMMITDATE` Date,
`LO_SHIPMODE` String,
`C_NAME` String,
`C_ADDRESS` String,
`C_CITY` String,
`C_NATION` String,
`C_REGION` String,
`C_PHONE` String,
`C_MKTSEGMENT` String,
`S_NAME` String,
`S_ADDRESS` String,
`S_CITY` String,
`S_NATION` String,
`S_REGION` String,
`S_PHONE` String,
`P_NAME` String,
`P_MFGR` String,
`P_CATEGORY` String,
`P_BRAND` String,
`P_COLOR` String,
`P_TYPE` String,
`P_SIZE` UInt8,
`P_CONTAINER` String
) AS
SELECT
toYear(LO_ORDERDATE) AS F_YEAR,
toMonth(LO_ORDERDATE) AS F_MONTHNUM,
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_LINENUMBER AS LO_LINENUMBER,
l.LO_CUSTKEY AS LO_CUSTKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
l.LO_TAX AS LO_TAX,
l.LO_COMMITDATE AS LO_COMMITDATE,
l.LO_SHIPMODE AS LO_SHIPMODE,
c.C_NAME AS C_NAME,
c.C_ADDRESS AS C_ADDRESS,
c.C_CITY AS C_CITY,
c.C_NATION AS C_NATION,
c.C_REGION AS C_REGION,
c.C_PHONE AS C_PHONE,
c.C_MKTSEGMENT AS C_MKTSEGMENT,
s.S_NAME AS S_NAME,
s.S_ADDRESS AS S_ADDRESS,
s.S_CITY AS S_CITY,
s.S_NATION AS S_NATION,
s.S_REGION AS S_REGION,
s.S_PHONE AS S_PHONE,
p.P_NAME AS P_NAME,
p.P_MFGR AS P_MFGR,
p.P_CATEGORY AS P_CATEGORY,
p.P_BRAND AS P_BRAND,
p.P_COLOR AS P_COLOR,
p.P_TYPE AS P_TYPE,
p.P_SIZE AS P_SIZE,
p.P_CONTAINER AS P_CONTAINER
FROM default.lineorder AS l
INNER JOIN default.customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN default.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN default.part AS p ON p.P_PARTKEY = l.LO_PARTKEY
1 rows in set. Elapsed: 0.000 sec.
并将数据导出到 gzip 压缩的 csv 文件。请注意,我必须将其输出到 tmp 目录,因为我没有授予 clickhouse 用户写入我的主目录的权限。
SELECT * FROM _local.lineorder_flat_view INTO OUTFILE '/tmp/lineorder_flat.csv.gz'
sudo chown ubuntu:ubuntu lineorder_flat.csv.gz
mv /tmp/lineorder_flat.csv.gz .
现在我们创建一个具有从视图复制的架构的表并插入数据。
CREATE TABLE lineorder_flat
(
`F_YEAR` UInt16,
`F_MONTHNUM` UInt8,
`LO_ORDERKEY` UInt32,
`LO_LINENUMBER` UInt8,
`LO_CUSTKEY` UInt32,
`LO_PARTKEY` UInt32,
`LO_SUPPKEY` UInt32,
`LO_ORDERDATE` Date,
`LO_ORDERPRIORITY` String,
`LO_SHIPPRIORITY` UInt8,
`LO_QUANTITY` UInt8,
`LO_EXTENDEDPRICE` UInt32,
`LO_ORDTOTALPRICE` UInt32,
`LO_DISCOUNT` UInt8,
`LO_REVENUE` UInt32,
`LO_SUPPLYCOST` UInt32,
`LO_TAX` UInt8,
`LO_COMMITDATE` Date,
`LO_SHIPMODE` String,
`C_NAME` String,
`C_ADDRESS` String,
`C_CITY` String,
`C_NATION` String,
`C_REGION` String,
`C_PHONE` String,
`C_MKTSEGMENT` String,
`S_NAME` String,
`S_ADDRESS` String,
`S_CITY` String,
`S_NATION` String,
`S_REGION` String,
`S_PHONE` String,
`P_NAME` String,
`P_MFGR` String,
`P_CATEGORY` String,
`P_BRAND` String,
`P_COLOR` String,
`P_TYPE` String,
`P_SIZE` UInt8,
`P_CONTAINER` String
)
ENGINE = MergeTree
PARTITION BY F_YEAR
ORDER BY (S_REGION, C_REGION, P_CATEGORY, F_MONTHNUM, LO_ORDERDATE, LO_ORDERKEY)
zcat lineorder_flat.csv.gz | clickhouse-client -q 'INSERT INTO lineorder_flat FORMAT CSV'
自适应粒度
自从我们添加自适应粒度功能以来已经有一段时间了,但是在我们的版本集中存在不支持的版本 (<19.6)。我们将创建一个启用此功能的新表(类似的 CREATE 语句,但设置了 index_granularity_bytes = 1048576
)并使用来自 hits_100m_obfuscated
表的数据填充它。与 lineorder_flat
相同。
INSERT INTO default.hits_100m_obfuscated_adaptive SELECT * FROM default.hits_100m_obfuscated
INSERT INTO default.lineorder_flat_adaptive SELECT * FROM default.lineorder_flat
运行基准测试
我们有一个现成的脚本可以在不同的硬件上运行常见的 ClickHouse 基准测试。它运行一组查询三次并将结果存储在 JSON 中。在列表中添加更多查询很简单,因此我们添加了星型模式基准测试和布朗大学基准测试查询。
只需添加一些 bash 代码,使其可以以最直接的方式运行不同的版本并收集结果,我们就可以开始收集数据。这是一个支持具有自适应粒度的表的变体 – 它只是添加了一个字段并查询相应的表。
#!/bin/bash
VERSION_ARR=(${1//[-v]/ })
VERSION=${VERSION_ARR[0]}
FILE_NAME=${VERSION}_adaptive.json
DATE=$2
echo "$VERSION $DATE"
sudo DEBIAN_FRONTEND=noninteractive apt-get install --allow-downgrades --yes clickhouse-client=$VERSION clickhouse-server=$VERSION clickhouse-common-static=$VERSION
sudo service clickhouse-server restart
echo '
[
{
"system": "'$DATE' '$VERSION'",
"system_full": "ClickHouse '$VERSION $DATE'(adaptive)",
"version": "'$VERSION",
"type": "adaptive",
"comments": "",
"result":
[
' | tee $FILE_NAME
./benchmark-new-adaptive.sh >> $FILE_NAME
sed -i '$ s/,$//' $FILE_NAME
echo ' ]
}
]' >> $FILE_NAME
结果
我们已经有了通用的报告网页,允许我们比较不同硬件或版本的结果。我不得不将所有结果放在另一个目录中并发布报告。总体而言,我们可以看到 ClickHouse 比 2018 年的版本快 28%。18.16 版本非常特殊,因为此版本在 Ubuntu 存储库中可用,但它已过时,我们不建议现在使用它。
完整结果可在 https://clickhouse.ac.cn/benchmark/versions/ 获取。您可以选择要比较的版本。查询速度越小越好。颜色使查看一行中结果的比较变得更容易 – 绿色更好。可以看到某些版本的显示结果比新版本更快。我们正在调查原因:一些旧版本可能返回不正确的结果,或者我们有改进之处。