跳至主要内容

使用 GitHub 数据在 ClickHouse 中编写查询

此数据集包含 ClickHouse 代码库的所有提交和更改。它可以使用 ClickHouse 附带的原生 git-import 工具生成。

生成的数据为以下每个表提供了一个 tsv 文件

  • commits - 包含统计信息的提交。
  • file_changes - 每次提交中更改的文件,以及有关更改和统计信息的信息。
  • line_changes - 每次提交中每个更改文件中每个更改的行,以及有关该行和此行先前更改的信息的完整信息。

截至 2022 年 11 月 8 日,每个 TSV 的大小和行数大约如下所示

  • commits - 7.8M - 266,051 行
  • file_changes - 53M - 266,051 行
  • line_changes - 2.7G - 7,535,157 行

目录

生成数据

这是可选的。我们免费分发数据 - 请参阅 下载和插入数据

git clone [email protected]:ClickHouse/ClickHouse.git
cd ClickHouse
clickhouse git-import --skip-paths 'generated\.cpp|^(contrib|docs?|website|libs/(libcityhash|liblz4|libdivide|libvectorclass|libdouble-conversion|libcpuid|libzstd|libfarmhash|libmetrohash|libpoco|libwidechar_width))/' --skip-commits-with-messages '^Merge branch '

对于 ClickHouse 代码库,这大约需要 3 分钟(截至 2022 年 11 月 8 日在 MacBook Pro 2021 上)才能完成。

可以从工具的原生帮助中获取可用选项的完整列表。

clickhouse git-import -h

此帮助还为上述每个表提供 DDL,例如:

CREATE TABLE git.commits
(
hash String,
author LowCardinality(String),
time DateTime,
message String,
files_added UInt32,
files_deleted UInt32,
files_renamed UInt32,
files_modified UInt32,
lines_added UInt32,
lines_deleted UInt32,
hunks_added UInt32,
hunks_removed UInt32,
hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

**这些查询应该适用于任何代码库。请随意探索并报告您的发现**关于执行时间的一些指南(截至 2022 年 11 月)

  • Linux - ~/clickhouse git-import - 160 分钟

下载和插入数据

以下数据可用于复制工作环境。或者,此数据集在 play.clickhouse.com 中可用 - 请参阅 查询以获取更多详细信息。

以下代码库的生成文件可在下方找到

要插入此数据,请通过执行以下查询来准备数据库

DROP DATABASE IF EXISTS git;
CREATE DATABASE git;

CREATE TABLE git.commits
(
hash String,
author LowCardinality(String),
time DateTime,
message String,
files_added UInt32,
files_deleted UInt32,
files_renamed UInt32,
files_modified UInt32,
lines_added UInt32,
lines_deleted UInt32,
hunks_added UInt32,
hunks_removed UInt32,
hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

CREATE TABLE git.file_changes
(
change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
path LowCardinality(String),
old_path LowCardinality(String),
file_extension LowCardinality(String),
lines_added UInt32,
lines_deleted UInt32,
hunks_added UInt32,
hunks_removed UInt32,
hunks_changed UInt32,

commit_hash String,
author LowCardinality(String),
time DateTime,
commit_message String,
commit_files_added UInt32,
commit_files_deleted UInt32,
commit_files_renamed UInt32,
commit_files_modified UInt32,
commit_lines_added UInt32,
commit_lines_deleted UInt32,
commit_hunks_added UInt32,
commit_hunks_removed UInt32,
commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

CREATE TABLE git.line_changes
(
sign Int8,
line_number_old UInt32,
line_number_new UInt32,
hunk_num UInt32,
hunk_start_line_number_old UInt32,
hunk_start_line_number_new UInt32,
hunk_lines_added UInt32,
hunk_lines_deleted UInt32,
hunk_context LowCardinality(String),
line LowCardinality(String),
indent UInt8,
line_type Enum('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),

prev_commit_hash String,
prev_author LowCardinality(String),
prev_time DateTime,

file_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
path LowCardinality(String),
old_path LowCardinality(String),
file_extension LowCardinality(String),
file_lines_added UInt32,
file_lines_deleted UInt32,
file_hunks_added UInt32,
file_hunks_removed UInt32,
file_hunks_changed UInt32,

commit_hash String,
author LowCardinality(String),
time DateTime,
commit_message String,
commit_files_added UInt32,
commit_files_deleted UInt32,
commit_files_renamed UInt32,
commit_files_modified UInt32,
commit_lines_added UInt32,
commit_lines_deleted UInt32,
commit_hunks_added UInt32,
commit_hunks_removed UInt32,
commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

使用 INSERT INTO SELECTs3 函数插入数据。例如,在下面,我们将 ClickHouse 文件插入到各自的表中

commits

INSERT INTO git.commits SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/commits.tsv.xz', 'TSV', 'hash String,author LowCardinality(String), time DateTime, message String, files_added UInt32, files_deleted UInt32, files_renamed UInt32, files_modified UInt32, lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32')

0 rows in set. Elapsed: 1.826 sec. Processed 62.78 thousand rows, 8.50 MB (34.39 thousand rows/s., 4.66 MB/s.)

file_changes

INSERT INTO git.file_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/file_changes.tsv.xz', 'TSV', 'change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6), path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32, commit_hash String, author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32')

0 rows in set. Elapsed: 2.688 sec. Processed 266.05 thousand rows, 48.30 MB (98.97 thousand rows/s., 17.97 MB/s.)

line_changes

INSERT INTO git.line_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/line_changes.tsv.xz', 'TSV', ' sign Int8, line_number_old UInt32, line_number_new UInt32, hunk_num UInt32, hunk_start_line_number_old UInt32, hunk_start_line_number_new UInt32, hunk_lines_added UInt32,\n hunk_lines_deleted UInt32, hunk_context LowCardinality(String), line LowCardinality(String), indent UInt8, line_type Enum(\'Empty\' = 0, \'Comment\' = 1, \'Punct\' = 2, \'Code\' = 3), prev_commit_hash String, prev_author LowCardinality(String), prev_time DateTime, file_change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6),\n path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), file_lines_added UInt32, file_lines_deleted UInt32, file_hunks_added UInt32, file_hunks_removed UInt32, file_hunks_changed UInt32, commit_hash String,\n author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32')

0 rows in set. Elapsed: 50.535 sec. Processed 7.54 million rows, 2.09 GB (149.11 thousand rows/s., 41.40 MB/s.)

查询

该工具通过其帮助输出建议了一些查询。除了我们感兴趣的一些其他补充问题之外,我们还回答了这些问题。这些查询的复杂度大致按工具的任意顺序递增。

此数据集在 play.clickhouse.comgit_clickhouse 数据库中可用。我们为所有查询提供了指向此环境的链接,并根据需要调整数据库名称。请注意,由于数据收集时间的差异,播放结果可能与此处显示的结果有所不同。

单个文件历史记录

最简单的查询之一。在这里,我们查看 StorageReplicatedMergeTree.cpp 的所有提交消息。由于这些消息可能更有趣,因此我们首先按最新的消息排序。

播放

SELECT
time,
substring(commit_hash, 1, 11) AS commit,
change_type,
author,
path,
old_path,
lines_added,
lines_deleted,
commit_message
FROM git.file_changes
WHERE path = 'src/Storages/StorageReplicatedMergeTree.cpp'
ORDER BY time DESC
LIMIT 10

┌────────────────time─┬─commit──────┬─change_type─┬─author─────────────┬─path────────────────────────────────────────┬─old_path─┬─lines_added─┬─lines_deleted─┬─commit_message───────────────────────────────────┐
2022-10-30 16:30:51 │ c68ab231f91 │ Modify │ Alexander Tokmakov │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 1310 │ fix accessing part in Deleting state │
2022-10-23 16:24:20 │ b40d9200d20 │ Modify │ Anton Popov │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 2830 │ better semantic of constsness of DataPartStorage │
2022-10-23 01:23:1556e5daba0c9 │ Modify │ Anton Popov │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 2844 │ remove DataPartStorageBuilder │
2022-10-21 13:35:37851f556d65a │ Modify │ Igor Nikonov │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 32 │ Remove unused parameter │
2022-10-21 13:02:5213d31eefbc3 │ Modify │ Igor Nikonov │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 44 │ Replicated merge tree polishing │
2022-10-21 12:25:194e76629aafc │ Modify │ Azat Khuzhin │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 32 │ Fixes for -Wshorten-64-to-32
2022-10-19 13:59:2805e6b94b541 │ Modify │ Antonio Andelic │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 40 │ Polishing │
2022-10-19 13:34:20 │ e5408aac991 │ Modify │ Antonio Andelic │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 353 │ Simplify logic │
2022-10-18 15:36:117befe2825c9 │ Modify │ Alexey Milovidov │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 22Update StorageReplicatedMergeTree.cpp │
2022-10-18 15:35:440623ad4e374 │ Modify │ Alexey Milovidov │ src/Storages/StorageReplicatedMergeTree.cpp │ │ 11Update StorageReplicatedMergeTree.cpp │
└─────────────────────┴─────────────┴─────────────┴────────────────────┴─────────────────────────────────────────────┴──────────┴─────────────┴───────────────┴──────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.006 sec. Processed 12.10 thousand rows, 1.60 MB (1.93 million rows/s., 255.40 MB/s.)

我们还可以查看行更改,不包括重命名,即在文件以其他名称存在时,我们不会显示重命名事件之前的更改

播放

SELECT
time,
substring(commit_hash, 1, 11) AS commit,
sign,
line_number_old,
line_number_new,
author,
line
FROM git.line_changes
WHERE path = 'src/Storages/StorageReplicatedMergeTree.cpp'
ORDER BY line_number_new ASC
LIMIT 10

┌────────────────time─┬─commit──────┬─sign─┬─line_number_old─┬─line_number_new─┬─author───────────┬─line──────────────────────────────────────────────────┐
2020-04-16 02:06:10 │ cdeda4ab915 │ -111 │ Alexey Milovidov │ #include <Disks/DiskSpaceMonitor.h> │
2020-04-16 02:06:10 │ cdeda4ab915 │ 121 │ Alexey Milovidov │ #include <Core/Defines.h> │
2020-04-16 02:06:10 │ cdeda4ab915 │ 122 │ Alexey Milovidov │ │
2021-05-03 23:46:5102ce9cc7254 │ -132 │ Alexey Milovidov │ #include <Common/FieldVisitors.h> │
2021-05-27 22:21:02 │ e2f29b9df02 │ -132 │ s-kat │ #include <Common/FieldVisitors.h> │
2022-10-03 22:30:50210882b9c4d │ 123 │ alesapin │ #include <ranges> │
2022-10-23 16:24:20 │ b40d9200d20 │ 123 │ Anton Popov │ #include <cstddef> │
2021-06-20 09:24:434c391f8e994 │ 123 │ Mike Kot │ #include "Common/hex.h" │
2021-12-29 09:18:568112a712336 │ -165 │ avogar │ #include <Common/ThreadPool.h> │
2022-04-21 20:19:139133e398b8c │ 11112 │ Nikolai Kochetov │ #include <Storages/MergeTree/DataPartStorageOnDisk.h> │
└─────────────────────┴─────────────┴──────┴─────────────────┴─────────────────┴──────────────────┴───────────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.258 sec. Processed 7.54 million rows, 654.92 MB (29.24 million rows/s., 2.54 GB/s.)

请注意,存在此查询的更复杂变体,其中我们找到了 文件的逐行提交历史记录,同时考虑了重命名。

查找当前活动文件

这对于以后的分析非常重要,因为我们只想考虑代码库中的当前文件。我们将此集合估计为尚未重命名或删除(然后重新添加/重命名)的文件。

请注意,在重命名期间,dbmslibstests/testflows/ 目录下的文件似乎存在提交历史记录损坏的情况。因此,我们还排除了这些目录。

播放

SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)') ORDER BY path
LIMIT 10

┌─path────────────────────────────────────────────────────────────┐
│ tests/queries/0_stateless/01054_random_printable_ascii_ubsan.sh │
│ tests/queries/0_stateless/02247_read_bools_as_numbers_json.sh │
│ tests/performance/file_table_function.xml │
│ tests/queries/0_stateless/01902_self_aliases_in_columns.sql
│ tests/queries/0_stateless/01070_h3_get_base_cell.reference │
│ src/Functions/ztest.cpp │
│ src/Interpreters/InterpreterShowTablesQuery.h │
│ src/Parsers/Kusto/ParserKQLStatement.h │
│ tests/queries/0_stateless/00938_dataset_test.sql
│ src/Dictionaries/Embedded/GeodataProviders/Types.h │
└─────────────────────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.085 sec. Processed 532.10 thousand rows, 8.68 MB (6.30 million rows/s., 102.64 MB/s.)

请注意,这允许文件被重命名,然后再次重命名为其原始值。首先,我们聚合 old_path 以获取由于重命名而导致的文件删除列表。我们将此与每个 path 的最后一个操作结合起来。最后,我们将此列表过滤为最终事件不是 Delete 的那些文件。

播放

SELECT uniq(path)
FROM
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)') ORDER BY path
)

┌─uniq(path)─┐
18559
└────────────┘
1 row in set. Elapsed: 0.089 sec. Processed 532.10 thousand rows, 8.68 MB (6.01 million rows/s., 97.99 MB/s.)

请注意,我们在导入期间跳过了几个目录的导入,即:

--skip-paths 'generated\.cpp|^(contrib|docs?|website|libs/(libcityhash|liblz4|libdivide|libvectorclass|libdouble-conversion|libcpuid|libzstd|libfarmhash|libmetrohash|libpoco|libwidechar_width))/'

将此模式应用于 git list-files,报告 18155。

git ls-files | grep -v -E 'generated\.cpp|^(contrib|docs?|website|libs/(libcityhash|liblz4|libdivide|libvectorclass|libdouble-conversion|libcpuid|libzstd|libfarmhash|libmetrohash|libpoco|libwidechar_width))/' | wc -l
18155

因此,我们目前的解决方案是对当前文件的估计

此处的差异是由几个因素造成的

  • 重命名可能与文件的其他修改同时发生。这些在 file_changes 中被列为单独的事件,但时间相同。argMax 函数无法区分这些事件 - 它选择第一个值。插入的自然顺序(了解正确顺序的唯一方法)在联合中不保持,因此可以选择修改事件。例如,在下面,src/Functions/geometryFromColumn.h 文件在重命名为 src/Functions/geometryConverters.h 之前有几个修改。我们目前的解决方案可能会选择修改事件作为最新的更改,导致保留 src/Functions/geometryFromColumn.h

播放

  SELECT
change_type,
path,
old_path,
time,
commit_hash
FROM git.file_changes
WHERE (path = 'src/Functions/geometryFromColumn.h') OR (old_path = 'src/Functions/geometryFromColumn.h')

┌─change_type─┬─path───────────────────────────────┬─old_path───────────────────────────┬────────────────time─┬─commit_hash──────────────────────────────┐
Add │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:169376b676e9a9bb8911b872e1887da85a45f7479d │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:166d59be5ea4768034f6526f7f9813062e0c369f7b │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:1633acc2aa5dc091a7cb948f78c558529789b2bad8 │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:1678e0db268ceadc42f82bc63a77ee1a4da6002463 │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:1614a891057d292a164c4179bfddaef45a74eaf83a │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:16 │ d0d6e6953c2a2af9fb2300921ff96b9362f22edb │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:16 │ fe8382521139a58c0ba277eb848e88894658db66 │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:163be3d5cde8788165bc0558f1e2a22568311c3103 │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:16 │ afad9bf4d0a55ed52a3f55483bc0973456e10a56 │
Modify │ src/Functions/geometryFromColumn.h │ │ 2021-03-11 12:08:16 │ e3290ecc78ca3ea82b49ebcda22b5d3a4df154e6 │
Rename │ src/Functions/geometryConverters.h │ src/Functions/geometryFromColumn.h │ 2021-03-11 12:08:16125945769586baf6ffd15919b29565b1b2a63218 │
└─────────────┴────────────────────────────────────┴────────────────────────────────────┴─────────────────────┴──────────────────────────────────────────┘
11 rows in set. Elapsed: 0.030 sec. Processed 266.05 thousand rows, 6.61 MB (8.89 million rows/s., 220.82 MB/s.)
  • 提交历史记录损坏 - 缺少删除事件。来源和原因待确定。

这些差异不应对我们的分析产生重大影响。**我们欢迎改进版本的此查询**。

列出修改次数最多的文件

限制为当前文件,我们将修改次数视为删除次数和添加次数的总和。

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
path,
sum(lines_added) + sum(lines_deleted) AS modifications
FROM git.file_changes
WHERE (path IN (current_files)) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY path
ORDER BY modifications DESC
LIMIT 10

┌─path───────────────────────────────────────────────────┬─modifications─┐
│ src/Storages/StorageReplicatedMergeTree.cpp │ 21871
│ src/Storages/MergeTree/MergeTreeData.cpp │ 17709
│ programs/client/Client.cpp │ 15882
│ src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp │ 14249
│ src/Interpreters/InterpreterSelectQuery.cpp │ 12636
│ src/Parsers/ExpressionListParsers.cpp │ 11794
│ src/Analyzer/QueryAnalysisPass.cpp │ 11760
│ src/Coordination/KeeperStorage.cpp │ 10225
│ src/Functions/FunctionsConversion.h │ 9247
│ src/Parsers/ExpressionElementParsers.cpp │ 8197
└────────────────────────────────────────────────────────┴───────────────┘

10 rows in set. Elapsed: 0.134 sec. Processed 798.15 thousand rows, 16.46 MB (5.95 million rows/s., 122.62 MB/s.)

代码提交通常发生在一周中的哪一天?

播放

SELECT
day_of_week,
count() AS c
FROM git.commits
GROUP BY dayOfWeek(time) AS day_of_week

┌─day_of_week─┬─────c─┐
110575
210645
310748
410944
510090
64617
75166
└─────────────┴───────┘
7 rows in set. Elapsed: 0.262 sec. Processed 62.78 thousand rows, 251.14 KB (239.73 thousand rows/s., 958.93 KB/s.)

这与周五生产力下降的情况相符。很高兴看到大家在周末也提交代码!非常感谢我们的贡献者!

子目录/文件历史记录 - 随时间推移的行数、提交次数和贡献者数量

这将产生一个巨大的查询结果,如果未经筛选,则无法现实地显示或可视化。因此,我们允许在以下示例中筛选文件或子目录。这里我们使用toStartOfWeek函数按周分组 - 根据需要调整。

播放

SELECT
week,
sum(lines_added) AS lines_added,
sum(lines_deleted) AS lines_deleted,
uniq(commit_hash) AS num_commits,
uniq(author) AS authors
FROM git.file_changes
WHERE path LIKE 'src/Storages%'
GROUP BY toStartOfWeek(time) AS week
ORDER BY week ASC
LIMIT 10

┌───────week─┬─lines_added─┬─lines_deleted─┬─num_commits─┬─authors─┐
2020-03-29493543
2020-04-059406015514
2020-04-1214726073211
2020-04-199178413912
2020-04-2610676263610
2020-05-035144352710
2020-05-1025525374812
2020-05-1735851913839
2020-05-24285118127418
2020-05-31277120777716
└────────────┴─────────────┴───────────────┴─────────────┴─────────┘
10 rows in set. Elapsed: 0.043 sec. Processed 266.05 thousand rows, 15.85 MB (6.12 million rows/s., 364.61 MB/s.)

此数据可以很好地可视化。下面我们使用 Superset。

新增和删除的行数

提交和作者

列出具有最多作者的文件

仅限于当前文件。

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
path,
uniq(author) AS num_authors
FROM git.file_changes
WHERE path IN (current_files)
GROUP BY path
ORDER BY num_authors DESC
LIMIT 10

┌─path────────────────────────────────────────┬─num_authors─┐
│ src/Core/Settings.h │ 127
│ CMakeLists.txt │ 96
.gitmodules │ 85
│ src/Storages/MergeTree/MergeTreeData.cpp │ 72
│ src/CMakeLists.txt │ 71
│ programs/server/Server.cpp │ 70
│ src/Interpreters/Context.cpp │ 64
│ src/Storages/StorageReplicatedMergeTree.cpp │ 63
│ src/Common/ErrorCodes.cpp │ 61
│ src/Interpreters/InterpreterSelectQuery.cpp │ 59
└─────────────────────────────────────────────┴─────────────┘

10 rows in set. Elapsed: 0.239 sec. Processed 798.15 thousand rows, 14.13 MB (3.35 million rows/s., 59.22 MB/s.)

存储库中最旧的代码行

仅限于当前文件。

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
any(path) AS file_path,
line,
max(time) AS latest_change,
any(file_change_type)
FROM git.line_changes
WHERE path IN (current_files)
GROUP BY line
ORDER BY latest_change ASC
LIMIT 10

┌─file_path───────────────────────────────────┬─line────────────────────────────────────────────────────────┬───────latest_change─┬─any(file_change_type)─┐
│ utils/compressor/test.sh │ ./compressor -d < compressor.snp > compressor2 │ 2011-06-17 22:19:39Modify
│ utils/compressor/test.sh │ ./compressor < compressor > compressor.snp │ 2011-06-17 22:19:39Modify
│ utils/compressor/test.sh │ ./compressor -d < compressor.qlz > compressor2 │ 2014-02-24 03:14:30Add
│ utils/compressor/test.sh │ ./compressor < compressor > compressor.qlz │ 2014-02-24 03:14:30Add
│ utils/config-processor/config-processor.cpp │ if (argc != 2)2014-02-26 19:10:00Add
│ utils/config-processor/config-processor.cpp │ std::cerr << "std::exception: " << e.what() << std::endl;2014-02-26 19:10:00Add
│ utils/config-processor/config-processor.cpp │ std::cerr << "Exception: " << e.displayText() << std::endl;2014-02-26 19:10:00Add
│ utils/config-processor/config-processor.cpp │ Poco::XML::DOMWriter().writeNode(std::cout, document);2014-02-26 19:10:00Add
│ utils/config-processor/config-processor.cpp │ std::cerr << "Some exception" << std::endl;2014-02-26 19:10:00Add
│ utils/config-processor/config-processor.cpp │ std::cerr << "usage: " << argv[0] << " path" << std::endl;2014-02-26 19:10:00Add
└─────────────────────────────────────────────┴─────────────────────────────────────────────────────────────┴─────────────────────┴───────────────────────┘

10 rows in set. Elapsed: 1.101 sec. Processed 8.07 million rows, 905.86 MB (7.33 million rows/s., 823.13 MB/s.)

具有最长历史的文件

仅限于当前文件。

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
count() AS c,
path,
max(time) AS latest_change
FROM git.file_changes
WHERE path IN (current_files)
GROUP BY path
ORDER BY c DESC
LIMIT 10

┌───c─┬─path────────────────────────────────────────┬───────latest_change─┐
790 │ src/Storages/StorageReplicatedMergeTree.cpp │ 2022-10-30 16:30:51
788 │ src/Storages/MergeTree/MergeTreeData.cpp │ 2022-11-04 09:26:44
752 │ src/Core/Settings.h │ 2022-10-25 11:35:25
749 │ CMakeLists.txt │ 2022-10-05 21:00:49
575 │ src/Interpreters/InterpreterSelectQuery.cpp │ 2022-11-01 10:20:10
563 │ CHANGELOG.md │ 2022-10-27 08:19:50
491 │ src/Interpreters/Context.cpp │ 2022-10-25 12:26:29
437 │ programs/server/Server.cpp │ 2022-10-21 12:25:19
375 │ programs/client/Client.cpp │ 2022-11-03 03:16:55
350 │ src/CMakeLists.txt │ 2022-10-24 09:22:37
└─────┴─────────────────────────────────────────────┴─────────────────────┘

10 rows in set. Elapsed: 0.124 sec. Processed 798.15 thousand rows, 14.71 MB (6.44 million rows/s., 118.61 MB/s.)

我们的核心数据结构,合并树,显然处于不断演变中,有着悠久的编辑历史!

贡献者在文档和代码方面在一个月的分布情况

在数据捕获期间,由于提交历史记录非常混乱,已过滤掉docs/文件夹中的更改。因此,此查询的结果不准确。

我们是否在某些月份(例如,发布日期前后)编写更多文档?我们可以使用countIf函数计算一个简单的比率,使用bar函数可视化结果。

播放

SELECT
day,
bar(docs_ratio * 1000, 0, 100, 100) AS bar
FROM
(
SELECT
day,
countIf(file_extension IN ('h', 'cpp', 'sql')) AS code,
countIf(file_extension = 'md') AS docs,
docs / (code + docs) AS docs_ratio
FROM git.line_changes
WHERE (sign = 1) AND (file_extension IN ('h', 'cpp', 'sql', 'md'))
GROUP BY dayOfMonth(time) AS day
)

┌─day─┬─bar─────────────────────────────────────────────────────────────┐
1 │ ███████████████████████████████████▍ │
2 │ ███████████████████████▋ │
3 │ ████████████████████████████████▋ │
4 │ █████████████ │
5 │ █████████████████████▎ │
6 │ ████████ │
7 │ ███▋ │
8 │ ████████▌ │
9 │ ██████████████▎ │
10 │ █████████████████▏ │
11 │ █████████████▎ │
12 │ ███████████████████████████████████▋ │
13 │ █████████████████████████████▎ │
14 │ ██████▋ │
15 │ █████████████████████████████████████████▊ │
16 │ ██████████▎ │
17 │ ██████████████████████████████████████▋ │
18 │ █████████████████████████████████▌ │
19 │ ███████████ │
20 │ █████████████████████████████████▊ │
21 │ █████ │
22 │ ███████████████████████▋ │
23 │ ███████████████████████████▌ │
24 │ ███████▌ │
25 │ ██████████████████████████████████▎ │
26 │ ███████████▏ │
27 │ ███████████████████████████████████████████████████████████████ │
28 │ ████████████████████████████████████████████████████▏ │
29 │ ███▌ │
30 │ ████████████████████████████████████████▎ │
31 │ █████████████████████████████████▏ │
└─────┴─────────────────────────────────────────────────────────────────┘

31 rows in set. Elapsed: 0.043 sec. Processed 7.54 million rows, 40.53 MB (176.71 million rows/s., 950.40 MB/s.)

也许在月底会稍微多一些,但总体来说,我们保持了良好的均衡分布。同样,由于在数据插入期间过滤了文档过滤器,因此这是不可靠的。

具有最多样化影响力的作者

我们认为这里的多样性是指作者贡献过的唯一文件的数量。

播放

SELECT
author,
uniq(path) AS num_files
FROM git.file_changes
WHERE (change_type IN ('Add', 'Modify')) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY author
ORDER BY num_files DESC
LIMIT 10

┌─author─────────────┬─num_files─┐
│ Alexey Milovidov │ 8433
│ Nikolai Kochetov │ 3257
│ Vitaly Baranov │ 2316
│ Maksim Kita │ 2172
│ Azat Khuzhin │ 1988
│ alesapin │ 1818
│ Alexander Tokmakov │ 1751
│ Amos Bird │ 1641
│ Ivan │ 1629
│ alexey-milovidov │ 1581
└────────────────────┴───────────┘

10 rows in set. Elapsed: 0.041 sec. Processed 266.05 thousand rows, 4.92 MB (6.56 million rows/s., 121.21 MB/s.)

让我们看看谁在最近的工作中拥有最多样化的提交。我们不会按日期限制,而是限制为作者的最后 N 次提交(在本例中,我们使用了 3,但可以随意修改)

播放

SELECT
author,
sum(num_files_commit) AS num_files
FROM
(
SELECT
author,
commit_hash,
uniq(path) AS num_files_commit,
max(time) AS commit_time
FROM git.file_changes
WHERE (change_type IN ('Add', 'Modify')) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY
author,
commit_hash
ORDER BY
author ASC,
commit_time DESC
LIMIT 3 BY author
)
GROUP BY author
ORDER BY num_files DESC
LIMIT 10

┌─author───────────────┬─num_files─┐
│ Mikhail │ 782
│ Li Yin │ 553
│ Roman Peshkurov │ 119
│ Vladimir Smirnov │ 88
│ f1yegor │ 65
│ maiha │ 54
│ Vitaliy Lyudvichenko │ 53
│ Pradeep Chhetri │ 40
│ Orivej Desh │ 38
│ liyang │ 36
└──────────────────────┴───────────┘

10 rows in set. Elapsed: 0.106 sec. Processed 266.05 thousand rows, 21.04 MB (2.52 million rows/s., 198.93 MB/s.)

作者喜欢的文件

在这里,我们选择了我们的创始人Alexey Milovidov,并将我们的分析限制在当前文件中。

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
path,
count() AS c
FROM git.file_changes
WHERE (author = 'Alexey Milovidov') AND (path IN (current_files))
GROUP BY path
ORDER BY c DESC
LIMIT 10

┌─path────────────────────────────────────────┬───c─┐
│ CMakeLists.txt │ 165
│ CHANGELOG.md │ 126
│ programs/server/Server.cpp │ 73
│ src/Storages/MergeTree/MergeTreeData.cpp │ 71
│ src/Storages/StorageReplicatedMergeTree.cpp │ 68
│ src/Core/Settings.h │ 65
│ programs/client/Client.cpp │ 57
│ programs/server/play.html │ 48
.gitmodules │ 47
│ programs/install/Install.cpp │ 37
└─────────────────────────────────────────────┴─────┘

10 rows in set. Elapsed: 0.106 sec. Processed 798.15 thousand rows, 13.97 MB (7.51 million rows/s., 131.41 MB/s.)

这是有道理的,因为 Alexey 一直负责维护变更日志。但是,如果我们使用文件的基名称来识别他流行的文件 - 这允许重命名,并且应该专注于代码贡献。

播放

SELECT
base,
count() AS c
FROM git.file_changes
WHERE (author = 'Alexey Milovidov') AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY basename(path) AS base
ORDER BY c DESC
LIMIT 10

┌─base───────────────────────────┬───c─┐
│ StorageReplicatedMergeTree.cpp │ 393
│ InterpreterSelectQuery.cpp │ 299
│ Aggregator.cpp │ 297
│ Client.cpp │ 280
│ MergeTreeData.cpp │ 274
│ Server.cpp │ 264
│ ExpressionAnalyzer.cpp │ 259
│ StorageMergeTree.cpp │ 239
│ Settings.h │ 225
│ TCPHandler.cpp │ 205
└────────────────────────────────┴─────┘
10 rows in set. Elapsed: 0.032 sec. Processed 266.05 thousand rows, 5.68 MB (8.22 million rows/s., 175.50 MB/s.)

这可能更能反映出他的兴趣领域。

作者数量最少的大文件

为此,我们首先需要识别最大的文件。通过对每个文件的完整文件重建,从提交的历史记录中估算这将非常昂贵!

为了估计,假设我们限制在当前文件中,我们对行添加进行求和并减去删除。然后,我们可以计算长度与作者数量的比率。

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
path,
sum(lines_added) - sum(lines_deleted) AS num_lines,
uniqExact(author) AS num_authors,
num_lines / num_authors AS lines_author_ratio
FROM git.file_changes
WHERE path IN (current_files)
GROUP BY path
ORDER BY lines_author_ratio DESC
LIMIT 10

┌─path──────────────────────────────────────────────────────────────────┬─num_lines─┬─num_authors─┬─lines_author_ratio─┐
│ src/Common/ClassificationDictionaries/emotional_dictionary_rus.txt │ 1485901148590
│ src/Functions/ClassificationDictionaries/emotional_dictionary_rus.txt │ 55533155533
│ src/Functions/ClassificationDictionaries/charset_freq.txt │ 35722135722
│ src/Common/ClassificationDictionaries/charset_freq.txt │ 35722135722
│ tests/integration/test_storage_meilisearch/movies.json │ 19549119549
│ tests/queries/0_stateless/02364_multiSearch_function_family.reference │ 12874112874
│ src/Functions/ClassificationDictionaries/programming_freq.txt │ 943419434
│ src/Common/ClassificationDictionaries/programming_freq.txt │ 943419434
│ tests/performance/explain_ast.xml │ 591115911
│ src/Analyzer/QueryAnalysisPass.cpp │ 568615686
└───────────────────────────────────────────────────────────────────────┴───────────┴─────────────┴────────────────────┘

10 rows in set. Elapsed: 0.138 sec. Processed 798.15 thousand rows, 16.57 MB (5.79 million rows/s., 120.11 MB/s.)

文本字典可能不切实际,因此让我们仅通过文件扩展名过滤器限制为代码!

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
path,
sum(lines_added) - sum(lines_deleted) AS num_lines,
uniqExact(author) AS num_authors,
num_lines / num_authors AS lines_author_ratio
FROM git.file_changes
WHERE (path IN (current_files)) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY path
ORDER BY lines_author_ratio DESC
LIMIT 10

┌─path──────────────────────────────────┬─num_lines─┬─num_authors─┬─lines_author_ratio─┐
│ src/Analyzer/QueryAnalysisPass.cpp │ 568615686
│ src/Analyzer/QueryTreeBuilder.cpp │ 8801880
│ src/Planner/Planner.cpp │ 8731873
│ src/Backups/RestorerFromBackup.cpp │ 8691869
│ utils/memcpy-bench/FastMemcpy.h │ 7701770
│ src/Planner/PlannerActionsVisitor.cpp │ 7651765
│ src/Functions/sphinxstemen.cpp │ 7281728
│ src/Planner/PlannerJoinTree.cpp │ 7081708
│ src/Planner/PlannerJoins.cpp │ 6951695
│ src/Analyzer/QueryNode.h │ 6071607
└───────────────────────────────────────┴───────────┴─────────────┴────────────────────┘
10 rows in set. Elapsed: 0.140 sec. Processed 798.15 thousand rows, 16.84 MB (5.70 million rows/s., 120.32 MB/s.)

这存在一些近期偏差 - 新文件提交的机会较少。如果我们限制为至少 1 年前的文件会怎样?

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
min(time) AS min_date,
path,
sum(lines_added) - sum(lines_deleted) AS num_lines,
uniqExact(author) AS num_authors,
num_lines / num_authors AS lines_author_ratio
FROM git.file_changes
WHERE (path IN (current_files)) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY path
HAVING min_date <= (now() - toIntervalYear(1))
ORDER BY lines_author_ratio DESC
LIMIT 10

┌────────────min_date─┬─path───────────────────────────────────────────────────────────┬─num_lines─┬─num_authors─┬─lines_author_ratio─┐
2021-03-08 07:00:54 │ utils/memcpy-bench/FastMemcpy.h │ 7701770
2021-05-04 13:47:34 │ src/Functions/sphinxstemen.cpp │ 7281728
2021-03-14 16:52:51 │ utils/memcpy-bench/glibc/dwarf2.h │ 5921592
2021-03-08 09:04:52 │ utils/memcpy-bench/FastMemcpy_Avx.h │ 4961496
2020-10-19 01:10:50 │ tests/queries/0_stateless/01518_nullable_aggregate_states2.sql4111411
2020-11-24 14:53:34 │ programs/server/GRPCHandler.cpp │ 3991399
2021-03-09 14:10:28 │ src/DataTypes/Serializations/SerializationSparse.cpp │ 3631363
2021-08-20 15:06:57 │ src/Functions/vectorFunctions.cpp │ 13274331.75
2020-08-04 03:26:23 │ src/Interpreters/MySQL/CreateQueryConvertVisitor.cpp │ 3111311
2020-11-06 15:45:13 │ src/Storages/Rocksdb/StorageEmbeddedRocksdb.cpp │ 6112305.5
└─────────────────────┴────────────────────────────────────────────────────────────────┴───────────┴─────────────┴────────────────────┘

10 rows in set. Elapsed: 0.143 sec. Processed 798.15 thousand rows, 18.00 MB (5.58 million rows/s., 125.87 MB/s.)

按时间、按星期几、按作者、按特定子目录分布的提交次数和代码行数

我们将此解释为按星期几新增和删除的行数。在本例中,我们关注Functions 目录

播放

SELECT
dayOfWeek,
uniq(commit_hash) AS commits,
sum(lines_added) AS lines_added,
sum(lines_deleted) AS lines_deleted
FROM git.file_changes
WHERE path LIKE 'src/Functions%'
GROUP BY toDayOfWeek(time) AS dayOfWeek

┌─dayOfWeek─┬─commits─┬─lines_added─┬─lines_deleted─┐
14762461915782
2434180989938
34962656220883
45876567418862
55048591714518
63141360410144
7294119386451
└───────────┴─────────┴─────────────┴───────────────┘

7 rows in set. Elapsed: 0.034 sec. Processed 266.05 thousand rows, 14.66 MB (7.73 million rows/s., 425.56 MB/s.)

以及按一天中的时间

播放

SELECT
hourOfDay,
uniq(commit_hash) AS commits,
sum(lines_added) AS lines_added,
sum(lines_deleted) AS lines_deleted
FROM git.file_changes
WHERE path LIKE 'src/Functions%'
GROUP BY toHour(time) AS hourOfDay

┌─hourOfDay─┬─commits─┬─lines_added─┬─lines_deleted─┐
07141693404
19021741927
26523431515
3762552493
46214801304
5381644253
610444342979
711741711678
810646044673
9135605502678
1014961333482
1118280403833
122092942815040
13187102045491
1420490286060
152311517910077
1619695685925
1713849413849
1812341933036
1916588176646
2014037492379
21132415854182
228540943955
2310033321719
└───────────┴─────────┴─────────────┴───────────────┘

24 rows in set. Elapsed: 0.039 sec. Processed 266.05 thousand rows, 14.66 MB (6.77 million rows/s., 372.89 MB/s.)

考虑到我们的大多数开发团队都在阿姆斯特丹,这种分布是有道理的。bar函数帮助我们可视化这些分布

播放

SELECT
hourOfDay,
bar(commits, 0, 400, 50) AS commits,
bar(lines_added, 0, 30000, 50) AS lines_added,
bar(lines_deleted, 0, 15000, 50) AS lines_deleted
FROM
(
SELECT
hourOfDay,
uniq(commit_hash) AS commits,
sum(lines_added) AS lines_added,
sum(lines_deleted) AS lines_deleted
FROM git.file_changes
WHERE path LIKE 'src/Functions%'
GROUP BY toHour(time) AS hourOfDay
)

┌─hourOfDay─┬─commits───────────────────────┬─lines_added────────────────────────────────────────┬─lines_deleted──────────────────────────────────────┐
0 │ ████████▊ │ ██████▊ │ ███████████▎ │
1 │ ███████████▎ │ ███▌ │ ██████▍ │
2 │ ████████ │ ███▊ │ █████ │
3 │ █████████▌ │ ████▎ │ █▋ │
4 │ ███████▋ │ ██▍ │ ████▎ │
5 │ ████▋ │ ██▋ │ ▋ │
6 │ █████████████ │ ███████▍ │ █████████▊ │
7 │ ██████████████▋ │ ██████▊ │ █████▌ │
8 │ █████████████▎ │ ███████▋ │ ███████████████▌ │
9 │ ████████████████▊ │ ██████████████████████████████████████████████████ │ ████████▊ │
10 │ ██████████████████▋ │ ██████████▏ │ ███████████▌ │
11 │ ██████████████████████▋ │ █████████████▍ │ ████████████▋ │
12 │ ██████████████████████████ │ █████████████████████████████████████████████████ │ ██████████████████████████████████████████████████ │
13 │ ███████████████████████▍ │ █████████████████ │ ██████████████████▎ │
14 │ █████████████████████████▌ │ ███████████████ │ ████████████████████▏ │
15 │ ████████████████████████████▊ │ █████████████████████████▎ │ █████████████████████████████████▌ │
16 │ ████████████████████████▌ │ ███████████████▊ │ ███████████████████▋ │
17 │ █████████████████▎ │ ████████▏ │ ████████████▋ │
18 │ ███████████████▍ │ ██████▊ │ ██████████ │
19 │ ████████████████████▋ │ ██████████████▋ │ ██████████████████████▏ │
20 │ █████████████████▌ │ ██████▏ │ ███████▊ │
21 │ ████████████████▌ │ ██████████████████████████████████████████████████ │ █████████████▊ │
22 │ ██████████▋ │ ██████▋ │ █████████████▏ │
23 │ ████████████▌ │ █████▌ │ █████▋ │
└───────────┴───────────────────────────────┴────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

24 rows in set. Elapsed: 0.038 sec. Processed 266.05 thousand rows, 14.66 MB (7.09 million rows/s., 390.69 MB/s.)

显示哪些作者倾向于重写其他作者代码的作者矩阵

sign = -1表示代码删除。我们排除标点符号和空行的插入。

播放

SELECT
prev_author || '(a)' as add_author,
author || '(d)' as delete_author,
count() AS c
FROM git.line_changes
WHERE (sign = -1) AND (file_extension IN ('h', 'cpp')) AND (line_type NOT IN ('Punct', 'Empty')) AND (author != prev_author) AND (prev_author != '')
GROUP BY
prev_author,
author
ORDER BY c DESC
LIMIT 1 BY prev_author
LIMIT 100

┌─prev_author──────────┬─author───────────┬─────c─┐
│ Ivan │ Alexey Milovidov │ 18554
│ Alexey Arno │ Alexey Milovidov │ 18475
│ Michael Kolupaev │ Alexey Milovidov │ 14135
│ Alexey Milovidov │ Nikolai Kochetov │ 13435
│ Andrey Mironov │ Alexey Milovidov │ 10418
│ proller │ Alexey Milovidov │ 7280
│ Nikolai Kochetov │ Alexey Milovidov │ 6806
│ alexey-milovidov │ Alexey Milovidov │ 5027
│ Vitaliy Lyudvichenko │ Alexey Milovidov │ 4390
│ Amos Bird │ Ivan Lezhankin │ 3125
│ f1yegor │ Alexey Milovidov │ 3119
│ Pavel Kartavyy │ Alexey Milovidov │ 3087
│ Alexey Zatelepin │ Alexey Milovidov │ 2978
│ alesapin │ Alexey Milovidov │ 2949
│ Sergey Fedorov │ Alexey Milovidov │ 2727
│ Ivan Lezhankin │ Alexey Milovidov │ 2618
│ Vasily Nemkov │ Alexey Milovidov │ 2547
│ Alexander Tokmakov │ Alexey Milovidov │ 2493
│ Nikita Vasilev │ Maksim Kita │ 2420
│ Anton Popov │ Amos Bird │ 2127
└──────────────────────┴──────────────────┴───────┘

20 rows in set. Elapsed: 0.098 sec. Processed 7.54 million rows, 42.16 MB (76.67 million rows/s., 428.99 MB/s.)

Sankey 图表(SuperSet)允许很好地可视化这一点。请注意,我们将LIMIT BY增加到 3,以获取每个作者的前 3 名代码移除者,以提高视觉效果的多样性。

Alexey 显然喜欢删除其他人的代码。让我们排除他,以便更均衡地查看代码删除情况。

在每周的哪一天,贡献者百分比最高?

如果我们仅考虑提交次数

播放

SELECT
day_of_week,
author,
count() AS c
FROM git.commits
GROUP BY
dayOfWeek(time) AS day_of_week,
author
ORDER BY
day_of_week ASC,
c DESC
LIMIT 1 BY day_of_week

┌─day_of_week─┬─author───────────┬────c─┐
1 │ Alexey Milovidov │ 2204
2 │ Alexey Milovidov │ 1588
3 │ Alexey Milovidov │ 1725
4 │ Alexey Milovidov │ 1915
5 │ Alexey Milovidov │ 1940
6 │ Alexey Milovidov │ 1851
7 │ Alexey Milovidov │ 2400
└─────────────┴──────────────────┴──────┘

7 rows in set. Elapsed: 0.012 sec. Processed 62.78 thousand rows, 395.47 KB (5.44 million rows/s., 34.27 MB/s.)

好的,这里有一些可能的优势可以归功于贡献时间最长的作者 - 我们的创始人 Alexey。让我们将我们的分析限制在过去一年。

播放

SELECT
day_of_week,
author,
count() AS c
FROM git.commits
WHERE time > (now() - toIntervalYear(1))
GROUP BY
dayOfWeek(time) AS day_of_week,
author
ORDER BY
day_of_week ASC,
c DESC
LIMIT 1 BY day_of_week

┌─day_of_week─┬─author───────────┬───c─┐
1 │ Alexey Milovidov │ 198
2 │ alesapin │ 162
3 │ alesapin │ 163
4 │ Azat Khuzhin │ 166
5 │ alesapin │ 191
6 │ Alexey Milovidov │ 179
7 │ Alexey Milovidov │ 243
└─────────────┴──────────────────┴─────┘

7 rows in set. Elapsed: 0.004 sec. Processed 21.82 thousand rows, 140.02 KB (4.88 million rows/s., 31.29 MB/s.)

这仍然有点简单,不能反映人们的工作。

一个更好的指标可能是,每天谁是贡献者中最大的,占过去一年完成的总工作的几分之几。请注意,我们平等对待代码的删除和添加。

播放

SELECT
top_author.day_of_week,
top_author.author,
top_author.author_work / all_work.total_work AS top_author_percent
FROM
(
SELECT
day_of_week,
author,
sum(lines_added) + sum(lines_deleted) AS author_work
FROM git.file_changes
WHERE time > (now() - toIntervalYear(1))
GROUP BY
author,
dayOfWeek(time) AS day_of_week
ORDER BY
day_of_week ASC,
author_work DESC
LIMIT 1 BY day_of_week
) AS top_author
INNER JOIN
(
SELECT
day_of_week,
sum(lines_added) + sum(lines_deleted) AS total_work
FROM git.file_changes
WHERE time > (now() - toIntervalYear(1))
GROUP BY dayOfWeek(time) AS day_of_week
) AS all_work USING (day_of_week)

┌─day_of_week─┬─author──────────────┬──top_author_percent─┐
1 │ Alexey Milovidov │ 0.3168282877768332
2 │ Mikhail f. Shiryaev │ 0.3523434231193969
3 │ vdimir │ 0.11859742484577324
4 │ Nikolay Degterinsky │ 0.34577318920318467
5 │ Alexey Milovidov │ 0.13208704423684223
6 │ Alexey Milovidov │ 0.18895257783624633
7 │ Robert Schulze │ 0.3617405888930302
└─────────────┴─────────────────────┴─────────────────────┘

7 rows in set. Elapsed: 0.014 sec. Processed 106.12 thousand rows, 1.38 MB (7.61 million rows/s., 98.65 MB/s.)

存储库中代码年龄的分布

我们将分析限制在当前文件中。为简洁起见,我们将结果限制为深度为 2,每个根文件夹 5 个文件。根据需要调整。

播放

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
concat(root, '/', sub_folder) AS folder,
round(avg(days_present)) AS avg_age_of_files,
min(days_present) AS min_age_files,
max(days_present) AS max_age_files,
count() AS c
FROM
(
SELECT
path,
dateDiff('day', min(time), toDate('2022-11-03')) AS days_present
FROM git.file_changes
WHERE (path IN (current_files)) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY path
)
GROUP BY
splitByChar('/', path)[1] AS root,
splitByChar('/', path)[2] AS sub_folder
ORDER BY
root ASC,
c DESC
LIMIT 5 BY root

┌─folder───────────────────────────┬─avg_age_of_files─┬─min_age_files─┬─max_age_files─┬────c─┐
│ base/base │ 38720139784
│ base/glibc-compatibility │ 8875999319
│ base/consistent-hashing │ 9939939935
│ base/widechar_width │ 9939939932
│ base/consistent-hashing-sumbur │ 9939939932
│ docker/test │ 1043104310431
│ programs/odbc-bridge │ 8359194525
│ programs/copier │ 5871494522
│ programs/library-bridge │ 1554760821
│ programs/disks │ 1446215014
│ programs/server │ 87470994510
│ rust/BLAKE3 │ 5252521
│ src/Functions │ 7520944809
│ src/Storages │ 7008944736
│ src/Interpreters │ 6843944490
│ src/Processors │ 70344944482
│ src/Common │ 6737944473
│ tests/queries │ 674-59453777
│ tests/integration │ 6561329454
│ utils/memcpy-bench │ 60159960510
│ utils/keeper-bench │ 5705695707
│ utils/durability-test │ 7937937934
│ utils/self-extracting-executable │ 1431431433
│ utils/self-extr-exec2242242242
└──────────────────────────────────┴──────────────────┴───────────────┴───────────────┴──────┘

24 rows in set. Elapsed: 0.129 sec. Processed 798.15 thousand rows, 15.11 MB (6.19 million rows/s., 117.08 MB/s.)

作者的多少代码已被其他作者删除?

对于这个问题,我们需要作者编写的行数除以他们被其他贡献者删除的总行数。

播放

SELECT
k,
written_code.c,
removed_code.c,
removed_code.c / written_code.c AS remove_ratio
FROM
(
SELECT
author AS k,
count() AS c
FROM git.line_changes
WHERE (sign = 1) AND (file_extension IN ('h', 'cpp')) AND (line_type NOT IN ('Punct', 'Empty'))
GROUP BY k
) AS written_code
INNER JOIN
(
SELECT
prev_author AS k,
count() AS c
FROM git.line_changes
WHERE (sign = -1) AND (file_extension IN ('h', 'cpp')) AND (line_type NOT IN ('Punct', 'Empty')) AND (author != prev_author)
GROUP BY k
) AS removed_code USING (k)
WHERE written_code.c > 1000
ORDER BY remove_ratio DESC
LIMIT 10

┌─k──────────────────┬─────c─┬─removed_code.c─┬───────remove_ratio─┐
│ Marek Vavruša │ 145813180.9039780521262003
│ Ivan │ 32715275000.8405930001528351
│ artpaul │ 345028400.8231884057971014
│ Silviu Caragea │ 154212090.7840466926070039
│ Ruslan │ 10278020.7809152872444012
│ Tsarkova Anastasia │ 175513640.7772079772079772
│ Vyacheslav Alipov │ 352627270.7733976176971072
│ Marek Vavruša │ 146711240.7661895023858214
│ f1yegor │ 719452130.7246316374756742
│ kreuzerkrieg │ 340624680.724603640634175
└────────────────────┴───────┴────────────────┴────────────────────┘

10 rows in set. Elapsed: 0.126 sec. Processed 15.07 million rows, 73.51 MB (119.97 million rows/s., 585.16 MB/s.)

列出重写次数最多的文件?

针对此问题的最简单方法可能是简单地计算每个路径(限制为当前文件)修改的行数,例如

WITH current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
)
SELECT
path,
count() AS c
FROM git.line_changes
WHERE (file_extension IN ('h', 'cpp', 'sql')) AND (path IN (current_files))
GROUP BY path
ORDER BY c DESC
LIMIT 10

┌─path───────────────────────────────────────────────────┬─────c─┐
│ src/Storages/StorageReplicatedMergeTree.cpp │ 21871
│ src/Storages/MergeTree/MergeTreeData.cpp │ 17709
│ programs/client/Client.cpp │ 15882
│ src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp │ 14249
│ src/Interpreters/InterpreterSelectQuery.cpp │ 12636
│ src/Parsers/ExpressionListParsers.cpp │ 11794
│ src/Analyzer/QueryAnalysisPass.cpp │ 11760
│ src/Coordination/KeeperStorage.cpp │ 10225
│ src/Functions/FunctionsConversion.h │ 9247
│ src/Parsers/ExpressionElementParsers.cpp │ 8197
└────────────────────────────────────────────────────────┴───────┘

10 rows in set. Elapsed: 0.160 sec. Processed 8.07 million rows, 98.99 MB (50.49 million rows/s., 619.49 MB/s.)

但是,这并没有捕捉到“重写”的概念,即在任何提交中文件的大部分内容都会发生变化。这需要更复杂的查询。如果我们认为重写是在删除超过 50% 的文件并添加 50% 的文件时发生的。您可以根据您自己对构成重写的理解调整查询。

查询仅限于当前文件。我们通过按pathcommit_hash分组列出所有文件更改,并返回新增和删除的行数。使用窗口函数,我们通过执行累积和并估计任何更改对文件大小的影响(新增行数 - 删除行数)来估计文件在任何时刻的总大小。使用此统计信息,我们可以计算每次更改添加或删除的文件百分比。最后,我们计算每个文件构成重写的文件更改次数,即(percent_add >= 0.5) AND (percent_delete >= 0.5) AND current_size > 50。请注意,我们要求文件超过 50 行才能避免将对文件的早期贡献计算为重写。这还可以避免对非常小的文件产生偏差,这些文件更有可能被重写。

播放

WITH
current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
),
changes AS
(
SELECT
path,
max(time) AS max_time,
commit_hash,
any(lines_added) AS num_added,
any(lines_deleted) AS num_deleted,
any(change_type) AS type
FROM git.file_changes
WHERE (change_type IN ('Add', 'Modify')) AND (path IN (current_files)) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY
path,
commit_hash
ORDER BY
path ASC,
max_time ASC
),
rewrites AS
(
SELECT
path,
commit_hash,
max_time,
type,
num_added,
num_deleted,
sum(num_added - num_deleted) OVER (PARTITION BY path ORDER BY max_time ASC) AS current_size,
if(current_size > 0, num_added / current_size, 0) AS percent_add,
if(current_size > 0, num_deleted / current_size, 0) AS percent_delete
FROM changes
)
SELECT
path,
count() AS num_rewrites
FROM rewrites
WHERE (type = 'Modify') AND (percent_add >= 0.5) AND (percent_delete >= 0.5) AND (current_size > 50)
GROUP BY path
ORDER BY num_rewrites DESC
LIMIT 10

┌─path──────────────────────────────────────────────────┬─num_rewrites─┐
│ src/Storages/WindowView/StorageWindowView.cpp │ 8
│ src/Functions/array/arrayIndex.h │ 7
│ src/Dictionaries/CacheDictionary.cpp │ 6
│ src/Dictionaries/RangeHashedDictionary.cpp │ 5
│ programs/client/Client.cpp │ 4
│ src/Functions/polygonPerimeter.cpp │ 4
│ src/Functions/polygonsEquals.cpp │ 4
│ src/Functions/polygonsWithin.cpp │ 4
│ src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp │ 4
│ src/Functions/polygonsSymDifference.cpp │ 4
└───────────────────────────────────────────────────────┴──────────────┘

10 rows in set. Elapsed: 0.299 sec. Processed 798.15 thousand rows, 31.52 MB (2.67 million rows/s., 105.29 MB/s.)

代码在哪个星期几最有可能保留在存储库中?

为此,我们需要唯一地识别一行代码。我们估计这一点(因为同一行可能会在文件中多次出现)使用路径和行内容。

我们查询新增的行,将其与删除的行连接 - 过滤掉后者发生时间晚于前者的情况。这使我们能够获得已删除的行,从中我们可以计算这两个事件之间的时间。

最后,我们汇总此数据集以计算按星期几计算的代码行保留在存储库中的平均天数。

播放

SELECT
day_of_week_added,
count() AS num,
avg(days_present) AS avg_days_present
FROM
(
SELECT
added_code.line,
added_code.time AS added_day,
dateDiff('day', added_code.time, removed_code.time) AS days_present
FROM
(
SELECT
path,
line,
max(time) AS time
FROM git.line_changes
WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty'))
GROUP BY
path,
line
) AS added_code
INNER JOIN
(
SELECT
path,
line,
max(time) AS time
FROM git.line_changes
WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty'))
GROUP BY
path,
line
) AS removed_code USING (path, line)
WHERE removed_code.time > added_code.time
)
GROUP BY dayOfWeek(added_day) AS day_of_week_added

┌─day_of_week_added─┬────num─┬───avg_days_present─┐
1171879193.81759260875384
2141448153.0931013517335
3161230137.61553681076722
4255728121.14149799787273
5203907141.60181847606998
662305202.43449161383518
770904220.0266134491707
└───────────────────┴────────┴────────────────────┘

7 rows in set. Elapsed: 3.965 sec. Processed 15.07 million rows, 1.92 GB (3.80 million rows/s., 483.50 MB/s.)

按平均代码年龄排序的文件

此查询使用与代码在哪个星期几最有可能保留在存储库中相同的原理 - 通过旨在使用路径和行内容唯一地识别一行代码。这使我们能够识别添加和删除行之间的时间。但是,我们将其过滤为仅限当前文件和代码,并计算每个文件的跨行平均时间。

播放

WITH
current_files AS
(
SELECT path
FROM
(
SELECT
old_path AS path,
max(time) AS last_time,
2 AS change_type
FROM git.file_changes
GROUP BY old_path
UNION ALL
SELECT
path,
max(time) AS last_time,
argMax(change_type, time) AS change_type
FROM git.file_changes
GROUP BY path
)
GROUP BY path
HAVING (argMax(change_type, last_time) != 2) AND (NOT match(path, '(^dbms/)|(^libs/)|(^tests/testflows/)|(^programs/server/store/)'))
ORDER BY path ASC
),
lines_removed AS
(
SELECT
added_code.path AS path,
added_code.line,
added_code.time AS added_day,
dateDiff('day', added_code.time, removed_code.time) AS days_present
FROM
(
SELECT
path,
line,
max(time) AS time,
any(file_extension) AS file_extension
FROM git.line_changes
WHERE (sign = 1) AND (line_type NOT IN ('Punct', 'Empty'))
GROUP BY
path,
line
) AS added_code
INNER JOIN
(
SELECT
path,
line,
max(time) AS time
FROM git.line_changes
WHERE (sign = -1) AND (line_type NOT IN ('Punct', 'Empty'))
GROUP BY
path,
line
) AS removed_code USING (path, line)
WHERE (removed_code.time > added_code.time) AND (path IN (current_files)) AND (file_extension IN ('h', 'cpp', 'sql'))
)
SELECT
path,
avg(days_present) AS avg_code_age
FROM lines_removed
GROUP BY path
ORDER BY avg_code_age DESC
LIMIT 10

┌─path────────────────────────────────────────────────────────────┬──────avg_code_age─┐
│ utils/corrector_utf8/corrector_utf8.cpp │ 1353.888888888889
│ tests/queries/0_stateless/01288_shard_max_network_bandwidth.sql881
│ src/Functions/replaceRegexpOne.cpp │ 861
│ src/Functions/replaceRegexpAll.cpp │ 861
│ src/Functions/replaceOne.cpp │ 861
│ utils/zookeeper-remove-by-list/main.cpp │ 838.25
│ tests/queries/0_stateless/01356_state_resample.sql819
│ tests/queries/0_stateless/01293_create_role.sql819
│ src/Functions/ReplaceStringImpl.h │ 810
│ src/Interpreters/createBlockSelector.cpp │ 795
└─────────────────────────────────────────────────────────────────┴───────────────────┘

10 rows in set. Elapsed: 3.134 sec. Processed 16.13 million rows, 1.83 GB (5.15 million rows/s., 582.99 MB/s.)

谁倾向于编写更多测试/CPP 代码/注释?

我们可以通过几种方法解决此问题。专注于代码与测试的比率,此查询相对简单 - 统计对包含tests的文件夹的贡献次数,并计算与总贡献次数的比率。

请注意,我们将其限制为更改次数超过 20 次的用户,以专注于常规提交者并避免对一次性贡献产生偏差。

播放

SELECT
author,
countIf((file_extension IN ('h', 'cpp', 'sql', 'sh', 'py', 'expect')) AND (path LIKE '%tests%')) AS test,
countIf((file_extension IN ('h', 'cpp', 'sql')) AND (NOT (path LIKE '%tests%'))) AS code,
code / (code + test) AS ratio_code
FROM git.file_changes
GROUP BY author
HAVING code > 20
ORDER BY code DESC
LIMIT 20

┌─author───────────────┬─test─┬──code─┬─────────ratio_code─┐
│ Alexey Milovidov │ 6617417990.8633303040317251
│ Nikolai Kochetov │ 916133610.9358408629263851
│ alesapin │ 240887960.785076758300607
│ kssenii │ 86967690.8862267609321812
│ Maksim Kita │ 79958620.8800480408347096
│ Alexander Tokmakov │ 147257270.7955271565495208
│ Vitaly Baranov │ 176455210.7578586135895676
│ Ivan Lezhankin │ 84346980.8478613968597726
│ Anton Popov │ 59943460.8788675429726996
│ Ivan │ 263042690.6187853312074214
│ Azat Khuzhin │ 166436970.689610147360567
│ Amos Bird │ 40029010.8788245986064829
│ proller │ 120723770.6632254464285714
│ chertus │ 45323590.8389046941678521
│ alexey-milovidov │ 30323210.8845274390243902
│ Alexey Arno │ 16923100.9318273497377975
│ Vitaliy Lyudvichenko │ 33422830.8723729461215132
│ Robert Schulze │ 18221960.9234650967199327
│ CurtizJ │ 46021580.8242933537051184
│ Alexander Kuzmenkov │ 29820920.8753138075313808
└──────────────────────┴──────┴───────┴────────────────────┘

20 rows in set. Elapsed: 0.034 sec. Processed 266.05 thousand rows, 4.65 MB (7.93 million rows/s., 138.76 MB/s.)

我们可以将此分布绘制为直方图。

播放

WITH (
SELECT histogram(10)(ratio_code) AS hist
FROM
(
SELECT
author,
countIf((file_extension IN ('h', 'cpp', 'sql', 'sh', 'py', 'expect')) AND (path LIKE '%tests%')) AS test,
countIf((file_extension IN ('h', 'cpp', 'sql')) AND (NOT (path LIKE '%tests%'))) AS code,
code / (code + test) AS ratio_code
FROM git.file_changes
GROUP BY author
HAVING code > 20
ORDER BY code DESC
LIMIT 20
)
) AS hist
SELECT
arrayJoin(hist).1 AS lower,
arrayJoin(hist).2 AS upper,
bar(arrayJoin(hist).3, 0, 100, 500) AS bar

┌──────────────lower─┬──────────────upper─┬─bar───────────────────────────┐
0.61878533120742140.6410053888179964 │ █████ │
0.64100538881799640.6764177968945693 │ █████ │
0.67641779689456930.7237343804750673 │ █████ │
0.72373438047506730.7740802855073157 │ █████▋ │
0.77408028550731570.807297655565091 │ ████████▋ │
0.8072976555650910.8338381996094653 │ ██████▎ │
0.83383819960946530.8533566747727687 │ ████████▋ │
0.85335667477276870.871392376017531 │ █████████▍ │
0.8713923760175310.904916108899021 │ ████████████████████████████▋ │
0.9049161088990210.9358408629263851 │ █████████████████▌ │
└────────────────────┴────────────────────┴───────────────────────────────┘
10 rows in set. Elapsed: 0.051 sec. Processed 266.05 thousand rows, 4.65 MB (5.24 million rows/s., 91.64 MB/s.)

正如预期的那样,大多数贡献者编写比测试更多的代码。

那么,谁在贡献代码时添加了最多的注释呢?

播放

SELECT
author,
avg(ratio_comments) AS avg_ratio_comments,
sum(code) AS code
FROM
(
SELECT
author,
commit_hash,
countIf(line_type = 'Comment') AS comments,
countIf(line_type = 'Code') AS code,
if(comments > 0, comments / (comments + code), 0) AS ratio_comments
FROM git.line_changes
GROUP BY
author,
commit_hash
)
GROUP BY author
ORDER BY code DESC
LIMIT 10
┌─author─────────────┬──avg_ratio_comments─┬────code─┐
│ Alexey Milovidov │ 0.10349154083099021147196
│ s-kat │ 0.1361718900215362614224
│ Nikolai Kochetov │ 0.08722993407690126218328
│ alesapin │ 0.1040477684726504198082
│ Vitaly Baranov │ 0.06446875712939285161801
│ Maksim Kita │ 0.06863376297549255156381
│ Alexey Arno │ 0.11252677608033655146642
│ Vitaliy Zakaznikov │ 0.06199215397180561138530
│ kssenii │ 0.07455322590796751131143
│ Artur │ 0.12383737231074826121484
└────────────────────┴─────────────────────┴─────────┘
10 rows in set. Elapsed: 0.290 sec. Processed 7.54 million rows, 394.57 MB (26.00 million rows/s., 1.36 GB/s.)

请注意,我们按代码贡献进行排序。我们所有最大的贡献者都出奇地高,这使得我们的代码易于阅读。

作者的提交如何随着时间的推移而变化,关于代码/注释百分比?

按作者计算此值非常简单,

播放

SELECT
author,
countIf(line_type = 'Code') AS code_lines,
countIf((line_type = 'Comment') OR (line_type = 'Punct')) AS comments,
code_lines / (comments + code_lines) AS ratio_code,
toStartOfWeek(time) AS week
FROM git.line_changes
GROUP BY
time,
author
ORDER BY
author ASC,
time ASC
LIMIT 10

┌─author──────────────────────┬─code_lines─┬─comments─┬─────────ratio_code─┬───────week─┐
1lann │ 8012022-03-06
200187122012020-09-13
243f6a8885a308d313198a2e037 │ 0202020-12-06
243f6a8885a308d313198a2e037 │ 011202020-12-06
243f6a8885a308d313198a2e037 │ 01402020-12-06
3ldar-nasyrov │ 2012021-03-14
821008736@qq.com2720.93103448275862072019-04-21
│ ANDREI STAROVEROV │ 182600.75206611570247942021-05-09
│ ANDREI STAROVEROV │ 7012021-05-09
│ ANDREI STAROVEROV │ 32120.72727272727272732021-05-09
└─────────────────────────────┴────────────┴──────────┴────────────────────┴────────────┘

10 rows in set. Elapsed: 0.145 sec. Processed 7.54 million rows, 51.09 MB (51.83 million rows/s., 351.44 MB/s.)

然而,理想情况下,我们希望看到所有作者从他们开始提交的第一天起,这些变化是如何在总体上发生的。他们是否会逐渐减少他们编写的注释数量?

为了计算这一点,我们首先计算每个作者随时间的注释比率——类似于谁倾向于编写更多测试/CPP代码/注释?。这与每个作者的开始日期相结合,使我们能够计算按周偏移的注释比率。

在计算所有作者的平均每周偏移后,我们通过选择每隔 10 周的结果来对这些结果进行采样。

播放

WITH author_ratios_by_offset AS
(
SELECT
author,
dateDiff('week', start_dates.start_date, contributions.week) AS week_offset,
ratio_code
FROM
(
SELECT
author,
toStartOfWeek(min(time)) AS start_date
FROM git.line_changes
WHERE file_extension IN ('h', 'cpp', 'sql')
GROUP BY author AS start_dates
) AS start_dates
INNER JOIN
(
SELECT
author,
countIf(line_type = 'Code') AS code,
countIf((line_type = 'Comment') OR (line_type = 'Punct')) AS comments,
comments / (comments + code) AS ratio_code,
toStartOfWeek(time) AS week
FROM git.line_changes
WHERE (file_extension IN ('h', 'cpp', 'sql')) AND (sign = 1)
GROUP BY
time,
author
HAVING code > 20
ORDER BY
author ASC,
time ASC
) AS contributions USING (author)
)
SELECT
week_offset,
avg(ratio_code) AS avg_code_ratio
FROM author_ratios_by_offset
GROUP BY week_offset
HAVING (week_offset % 10) = 0
ORDER BY week_offset ASC
LIMIT 20

┌─week_offset─┬──────avg_code_ratio─┐
00.21626798253005078
100.18299433892099454
200.22847255749045017
300.2037816688365288
400.1987063517030308
500.17341406302829748
600.1808884776496144
700.18711773536450496
800.18905573684766458
900.2505147771581594
1000.2427673990917429
1100.19088569009169926
1200.14218574654598348
1300.20894252550489317
1400.22316626978848397
1500.1859507592277053
1600.22007759757363546
1700.20406936638195144
1800.1412102467834332
1900.20677550885049117
└─────────────┴─────────────────────┘

20 rows in set. Elapsed: 0.167 sec. Processed 15.07 million rows, 101.74 MB (90.51 million rows/s., 610.98 MB/s.)

令人鼓舞的是,我们的注释百分比非常稳定,并且在作者贡献时间越长的情况下不会下降。

代码被重写前的平均时间和中位数(代码衰减的半衰期)是多少?

我们可以使用与列出被重写次数最多或被最多作者重写的文件相同的原则来识别重写,但考虑所有文件。窗口函数用于计算每个文件的重写之间的时间。由此,我们可以计算所有文件的平均值和中位数。

播放

WITH
changes AS
(
SELECT
path,
commit_hash,
max_time,
type,
num_added,
num_deleted,
sum(num_added - num_deleted) OVER (PARTITION BY path ORDER BY max_time ASC) AS current_size,
if(current_size > 0, num_added / current_size, 0) AS percent_add,
if(current_size > 0, num_deleted / current_size, 0) AS percent_delete
FROM
(
SELECT
path,
max(time) AS max_time,
commit_hash,
any(lines_added) AS num_added,
any(lines_deleted) AS num_deleted,
any(change_type) AS type
FROM git.file_changes
WHERE (change_type IN ('Add', 'Modify')) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY
path,
commit_hash
ORDER BY
path ASC,
max_time ASC
)
),
rewrites AS
(
SELECT
*,
any(max_time) OVER (PARTITION BY path ORDER BY max_time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS previous_rewrite,
dateDiff('day', previous_rewrite, max_time) AS rewrite_days
FROM changes
WHERE (type = 'Modify') AND (percent_add >= 0.5) AND (percent_delete >= 0.5) AND (current_size > 50)
)
SELECT
avgIf(rewrite_days, rewrite_days > 0) AS avg_rewrite_time,
quantilesTimingIf(0.5)(rewrite_days, rewrite_days > 0) AS half_life
FROM rewrites

┌─avg_rewrite_time─┬─half_life─┐
122.2890625[23]
└──────────────────┴───────────┘

1 row in set. Elapsed: 0.388 sec. Processed 266.05 thousand rows, 22.85 MB (685.82 thousand rows/s., 58.89 MB/s.)

从代码被重写的可能性最高的角度来看,编写代码的最糟糕时间是什么?

类似于代码被重写前的平均时间和中位数(代码衰减的半衰期)是多少?列出被重写次数最多或被最多作者重写的文件,除了我们按星期几进行聚合。根据需要调整,例如一年中的月份。

播放

WITH
changes AS
(
SELECT
path,
commit_hash,
max_time,
type,
num_added,
num_deleted,
sum(num_added - num_deleted) OVER (PARTITION BY path ORDER BY max_time ASC) AS current_size,
if(current_size > 0, num_added / current_size, 0) AS percent_add,
if(current_size > 0, num_deleted / current_size, 0) AS percent_delete
FROM
(
SELECT
path,
max(time) AS max_time,
commit_hash,
any(file_lines_added) AS num_added,
any(file_lines_deleted) AS num_deleted,
any(file_change_type) AS type
FROM git.line_changes
WHERE (file_change_type IN ('Add', 'Modify')) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY
path,
commit_hash
ORDER BY
path ASC,
max_time ASC
)
),
rewrites AS
(
SELECT any(max_time) OVER (PARTITION BY path ORDER BY max_time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS previous_rewrite
FROM changes
WHERE (type = 'Modify') AND (percent_add >= 0.5) AND (percent_delete >= 0.5) AND (current_size > 50)
)
SELECT
dayOfWeek(previous_rewrite) AS dayOfWeek,
count() AS num_re_writes
FROM rewrites
GROUP BY dayOfWeek

┌─dayOfWeek─┬─num_re_writes─┐
1111
2121
391
4111
590
664
746
└───────────┴───────────────┘

7 rows in set. Elapsed: 0.466 sec. Processed 7.54 million rows, 701.52 MB (16.15 million rows/s., 1.50 GB/s.)

哪些作者的代码粘性最高?

我们将“粘性”定义为作者的代码在被重写之前保留了多长时间。类似于之前的问题代码被重写前的平均时间和中位数(代码衰减的半衰期)是多少?——使用相同的重写指标,即文件 50% 的添加和 50% 的删除。我们计算每个作者的平均重写时间,并且只考虑拥有两个以上文件的贡献者。

播放

WITH
changes AS
(
SELECT
path,
author,
commit_hash,
max_time,
type,
num_added,
num_deleted,
sum(num_added - num_deleted) OVER (PARTITION BY path ORDER BY max_time ASC) AS current_size,
if(current_size > 0, num_added / current_size, 0) AS percent_add,
if(current_size > 0, num_deleted / current_size, 0) AS percent_delete
FROM
(
SELECT
path,
any(author) AS author,
max(time) AS max_time,
commit_hash,
any(file_lines_added) AS num_added,
any(file_lines_deleted) AS num_deleted,
any(file_change_type) AS type
FROM git.line_changes
WHERE (file_change_type IN ('Add', 'Modify')) AND (file_extension IN ('h', 'cpp', 'sql'))
GROUP BY
path,
commit_hash
ORDER BY
path ASC,
max_time ASC
)
),
rewrites AS
(
SELECT
*,
any(max_time) OVER (PARTITION BY path ORDER BY max_time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS previous_rewrite,
dateDiff('day', previous_rewrite, max_time) AS rewrite_days,
any(author) OVER (PARTITION BY path ORDER BY max_time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS prev_author
FROM changes
WHERE (type = 'Modify') AND (percent_add >= 0.5) AND (percent_delete >= 0.5) AND (current_size > 50)
)
SELECT
prev_author,
avg(rewrite_days) AS c,
uniq(path) AS num_files
FROM rewrites
GROUP BY prev_author
HAVING num_files > 2
ORDER BY c DESC
LIMIT 10

┌─prev_author─────────┬──────────────────c─┬─num_files─┐
│ Michael Kolupaev │ 304.64
│ alexey-milovidov │ 81.833333333333334
│ Alexander Kuzmenkov │ 64.55
│ Pavel Kruglov │ 55.86
│ Alexey Milovidov │ 48.41666666666666490
│ Amos Bird │ 42.84
│ alesapin │ 38.08333333333333612
│ Nikolai Kochetov │ 33.1842105263157926
│ Alexander Tokmakov │ 31.86666666666666712
│ Alexey Zatelepin │ 22.54
└─────────────────────┴────────────────────┴───────────┘

10 rows in set. Elapsed: 0.555 sec. Processed 7.54 million rows, 720.60 MB (13.58 million rows/s., 1.30 GB/s.)

作者连续提交的天数最多

此查询首先需要我们计算作者提交的日期。使用窗口函数,按作者进行分区,我们可以计算他们提交之间的日期。对于每次提交,如果自上次提交以来的时间为 1 天,则将其标记为连续 (1),否则为 0——将此结果存储在consecutive_day中。

我们随后的数组函数计算每个作者最长的连续 1 的序列。首先,groupArray函数用于整理作者的所有consecutive_day值。然后,这个由 1 和 0 组成的数组将根据 0 值拆分为子数组。最后,我们计算最长的子数组。

播放

WITH commit_days AS
(
SELECT
author,
day,
any(day) OVER (PARTITION BY author ORDER BY day ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS previous_commit,
dateDiff('day', previous_commit, day) AS days_since_last,
if(days_since_last = 1, 1, 0) AS consecutive_day
FROM
(
SELECT
author,
toStartOfDay(time) AS day
FROM git.commits
GROUP BY
author,
day
ORDER BY
author ASC,
day ASC
)
)
SELECT
author,
arrayMax(arrayMap(x -> length(x), arraySplit(x -> (x = 0), groupArray(consecutive_day)))) - 1 AS max_consecutive_days
FROM commit_days
GROUP BY author
ORDER BY max_consecutive_days DESC
LIMIT 10

┌─author───────────┬─max_consecutive_days─┐
│ kssenii │ 32
│ Alexey Milovidov │ 30
│ alesapin │ 26
│ Azat Khuzhin │ 23
│ Nikolai Kochetov │ 15
│ feng lv │ 11
│ alexey-milovidov │ 11
│ Igor Nikonov │ 11
│ Maksim Kita │ 11
│ Nikita Vasilev │ 11
└──────────────────┴──────────────────────┘

10 rows in set. Elapsed: 0.025 sec. Processed 62.78 thousand rows, 395.47 KB (2.54 million rows/s., 16.02 MB/s.)

文件的逐行提交历史

文件可以重命名。当这种情况发生时,我们会得到一个重命名事件,其中path列设置为文件的新的路径,而old_path表示之前的路径,例如:

播放

SELECT
time,
path,
old_path,
commit_hash,
commit_message
FROM git.file_changes
WHERE (path = 'src/Storages/StorageReplicatedMergeTree.cpp') AND (change_type = 'Rename')

┌────────────────time─┬─path────────────────────────────────────────┬─old_path─────────────────────────────────────┬─commit_hash──────────────────────────────┬─commit_message─┐
2020-04-03 16:14:31 │ src/Storages/StorageReplicatedMergeTree.cpp │ dbms/Storages/StorageReplicatedMergeTree.cpp │ 06446b4f08a142d6f1bc30664c47ded88ab51782 │ dbms/ → src/
└─────────────────────┴─────────────────────────────────────────────┴──────────────────────────────────────────────┴──────────────────────────────────────────┴────────────────┘

1 row in set. Elapsed: 0.135 sec. Processed 266.05 thousand rows, 20.73 MB (1.98 million rows/s., 154.04 MB/s.)

这使得查看文件的完整历史变得具有挑战性,因为我们没有一个连接所有行或文件更改的单一值。

为了解决这个问题,我们可以使用用户定义函数 (UDF)。目前,这些函数不能是递归的,因此要识别文件的历史记录,我们必须定义一系列显式调用彼此的 UDF。

这意味着我们只能跟踪重命名到最大深度——以下示例为 5 层。文件不太可能被重命名超过此次数,因此目前这已经足够了。

CREATE FUNCTION file_path_history AS (n) -> if(empty(n),  [], arrayConcat([n], file_path_history_01((SELECT if(empty(old_path), Null, old_path) FROM git.file_changes WHERE path = n AND (change_type = 'Rename' OR change_type = 'Add') LIMIT 1))));
CREATE FUNCTION file_path_history_01 AS (n) -> if(isNull(n), [], arrayConcat([n], file_path_history_02((SELECT if(empty(old_path), Null, old_path) FROM git.file_changes WHERE path = n AND (change_type = 'Rename' OR change_type = 'Add') LIMIT 1))));
CREATE FUNCTION file_path_history_02 AS (n) -> if(isNull(n), [], arrayConcat([n], file_path_history_03((SELECT if(empty(old_path), Null, old_path) FROM git.file_changes WHERE path = n AND (change_type = 'Rename' OR change_type = 'Add') LIMIT 1))));
CREATE FUNCTION file_path_history_03 AS (n) -> if(isNull(n), [], arrayConcat([n], file_path_history_04((SELECT if(empty(old_path), Null, old_path) FROM git.file_changes WHERE path = n AND (change_type = 'Rename' OR change_type = 'Add') LIMIT 1))));
CREATE FUNCTION file_path_history_04 AS (n) -> if(isNull(n), [], arrayConcat([n], file_path_history_05((SELECT if(empty(old_path), Null, old_path) FROM git.file_changes WHERE path = n AND (change_type = 'Rename' OR change_type = 'Add') LIMIT 1))));
CREATE FUNCTION file_path_history_05 AS (n) -> if(isNull(n), [], [n]);

通过调用file_path_history('src/Storages/StorageReplicatedMergeTree.cpp'),我们遍历重命名历史记录,每个函数使用old_path调用下一级。结果使用arrayConcat组合。

例如:

播放

SELECT file_path_history('src/Storages/StorageReplicatedMergeTree.cpp') AS paths

┌─paths─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
['src/Storages/StorageReplicatedMergeTree.cpp','dbms/Storages/StorageReplicatedMergeTree.cpp','dbms/src/Storages/StorageReplicatedMergeTree.cpp']
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.074 sec. Processed 344.06 thousand rows, 6.27 MB (4.65 million rows/s., 84.71 MB/s.)

我们现在可以使用此功能来组装文件的整个历史记录的提交。在此示例中,我们显示了每个path值的提交。

播放

SELECT
time,
substring(commit_hash, 1, 11) AS commit,
change_type,
author,
path,
commit_message
FROM git.file_changes
WHERE path IN file_path_history('src/Storages/StorageReplicatedMergeTree.cpp')
ORDER BY time DESC
LIMIT 1 BY path
FORMAT PrettyCompactMonoBlock

┌────────────────time─┬─commit──────┬─change_type─┬─author─────────────┬─path─────────────────────────────────────────────┬─commit_message──────────────────────────────────────────────────────────────────┐
2022-10-30 16:30:51 │ c68ab231f91 │ Modify │ Alexander Tokmakov │ src/Storages/StorageReplicatedMergeTree.cpp │ fix accessing part in Deleting state │
2020-04-03 15:21:2438a50f44d34 │ Modify │ alesapin │ dbms/Storages/StorageReplicatedMergeTree.cpp │ Remove empty line │
2020-04-01 19:21:271d5a77c1132 │ Modify │ alesapin │ dbms/src/Storages/StorageReplicatedMergeTree.cpp │ Tried to add ability to rename primary key columns but just banned this ability │
└─────────────────────┴─────────────┴─────────────┴────────────────────┴──────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘

3 rows in set. Elapsed: 0.170 sec. Processed 611.53 thousand rows, 41.76 MB (3.60 million rows/s., 246.07 MB/s.)

未解决的问题

Git 责备

由于目前无法在数组函数中保持状态,因此很难获得准确的结果。这可以通过arrayFoldarrayReduce实现,它们允许在每次迭代中保持状态。

一个近似的解决方案,足以进行高级分析,可能如下所示

播放

SELECT
line_number_new,
argMax(author, time),
argMax(line, time)
FROM git.line_changes
WHERE path IN file_path_history('src/Storages/StorageReplicatedMergeTree.cpp')
GROUP BY line_number_new
ORDER BY line_number_new ASC
LIMIT 20

┌─line_number_new─┬─argMax(author, time)─┬─argMax(line, time)────────────────────────────────────────────┐
1 │ Alexey Milovidov │ #include <Disks/DiskSpaceMonitor.h> │
2 │ s-kat │ #include <Common/FieldVisitors.h> │
3 │ Anton Popov │ #include <cstddef> │
4 │ Alexander Burmak │ #include <Common/typeid_cast.h> │
5 │ avogar │ #include <Common/ThreadPool.h> │
6 │ Alexander Burmak │ #include <Common/DiskSpaceMonitor.h> │
7 │ Alexander Burmak │ #include <Common/ZooKeeper/Types.h> │
8 │ Alexander Burmak │ #include <Common/escapeForFileName.h> │
9 │ Alexander Burmak │ #include <Common/formatReadable.h> │
10 │ Alexander Burmak │ #include <Common/thread_local_rng.h> │
11 │ Alexander Burmak │ #include <Common/typeid_cast.h> │
12 │ Nikolai Kochetov │ #include <Storages/MergeTree/DataPartStorageOnDisk.h> │
13 │ alesapin │ #include <Disks/ObjectStorages/IMetadataStorage.h> │
14 │ alesapin │ │
15 │ Alexey Milovidov │ #include <DB/Databases/IDatabase.h> │
16 │ Alexey Zatelepin │ #include <Storages/MergeTree/ReplicatedMergeTreePartHeader.h> │
17 │ CurtizJ │ #include <Storages/MergeTree/MergeTreeDataPart.h> │
18 │ Kirill Shvakov │ #include <Parsers/ASTDropQuery.h> │
19 │ s-kat │ #include <Storages/MergeTree/PinnedPartUUIDs.h> │
20 │ Nikita Mikhaylov │ #include <Storages/MergeTree/MergeMutateExecutor.h> │
└─────────────────┴──────────────────────┴───────────────────────────────────────────────────────────────┘
20 rows in set. Elapsed: 0.547 sec. Processed 7.88 million rows, 679.20 MB (14.42 million rows/s., 1.24 GB/s.)

我们欢迎在此提供准确和改进的解决方案。