跳至主要内容

窗口函数

窗口函数允许您对与当前行相关的行集执行计算。您可以执行的一些计算类似于使用聚合函数可以执行的计算,但窗口函数不会导致行被分组到单个输出中 - 各个行仍将被返回。

标准窗口函数

ClickHouse 支持用于定义窗口和窗口函数的标准语法。下表指示当前是否支持某个功能。

功能支持?
临时窗口规范 (count(*) over (partition by id order by time desc))
包含窗口函数的表达式,例如 (count(*) over ()) / 2)
WINDOW 子句 (select ... from table window w as (partition by id))
ROWS 框架
RANGE 框架✅ (默认)
INTERVAL 用于 DateTime RANGE OFFSET 框架的语法❌ (改为指定秒数 (RANGE 支持任何数字类型)。)
GROUPS 框架
在框架上计算聚合函数 (sum(value) over (order by time))✅ (所有聚合函数都受支持)
rank()dense_rank()row_number()
别名:denseRank()
percent_rank()✅ 有效地计算数据集分区中值的相对排名。此函数有效地替代了更冗长且计算量更大的手动 SQL 计算,表示为 ifNull((rank() OVER(PARTITION BY x ORDER BY y) - 1) / nullif(count(1) OVER(PARTITION BY x) - 1, 0), 0)
别名:percentRank()
lag/lead(value, offset)
您可以使用以下一种解决方法
1) any(value) over (.... rows between <offset> preceding and <offset> preceding),或 following 用于 lead
2) lagInFrame/leadInFrame,它们类似,但会尊重窗口框架。要获得与 lag/lead 相同的行为,请使用 rows between unbounded preceding and unbounded following
ntile(buckets)
指定类似的窗口,(partition by x order by y rows between unbounded preceding and unrounded following)。

ClickHouse 特定窗口函数

还有以下 ClickHouse 特定窗口函数

nonNegativeDerivative(metric_column, timestamp_column[, INTERVAL X UNITS])

根据 timestamp_column 查找给定 metric_column 的非负导数。可以省略 INTERVAL,默认值为 INTERVAL 1 SECOND。对于每一行,计算出的值如下

  • 0 对于第 1 行,
  • metricimetrici1timestampitimestampi1interval{\text{metric}_i - \text{metric}_{i-1} \over \text{timestamp}_i - \text{timestamp}_{i-1}} * \text{interval} 对于 ithi_{th} 行。

语法

aggregate_function (column_name)
OVER ([[PARTITION BY grouping_column] [ORDER BY sorting_column]
[ROWS or RANGE expression_to_bound_rows_withing_the_group]] | [window_name])
FROM table_name
WINDOW window_name as ([[PARTITION BY grouping_column] [ORDER BY sorting_column])
  • PARTITION BY - 定义如何将结果集分成组。
  • ORDER BY - 定义在计算 aggregate_function 期间如何对组内的行进行排序。
  • ROWS or RANGE - 定义框架的边界,aggregate_function 在框架内计算。
  • WINDOW - 允许多个表达式使用相同的窗口定义。
      PARTITION
┌─────────────────┐ <-- UNBOUNDED PRECEDING (BEGINNING of the PARTITION)
│ │
│ │
│=================│ <-- N PRECEDING <─┐
│ N ROWS │ │ F
│ Before CURRENT │ │ R
│~~~~~~~~~~~~~~~~~│ <-- CURRENT ROW │ A
│ M ROWS │ │ M
│ After CURRENT │ │ E
│=================│ <-- M FOLLOWING <─┘
│ │
│ │
└─────────────────┘ <--- UNBOUNDED FOLLOWING (END of the PARTITION)

函数

这些函数只能用作窗口函数。

  • row_number() - 从 1 开始对分区内的当前行进行编号。
  • first_value(x) - 返回在其排序框架内计算的第一个值。
  • last_value(x) - 返回在其排序框架内计算的最后一个值。
  • nth_value(x, offset) - 返回在其排序框架内针对第 n 行 (偏移量) 计算的第一个非 NULL 值。
  • rank() - 对分区内的当前行进行排名,并带间隙。
  • dense_rank() - 对分区内的当前行进行排名,不带间隙。
  • lagInFrame(x) - 返回在其排序框架内在当前行之前指定的物理偏移行处计算的值。
  • leadInFrame(x) - 返回在其排序框架内在当前行之后偏移行处计算的值。

示例

让我们看看如何使用窗口函数的一些示例。

对行进行编号

CREATE TABLE salaries
(
`team` String,
`player` String,
`salary` UInt32,
`position` String
)
Engine = Memory;

INSERT INTO salaries FORMAT Values
('Port Elizabeth Barbarians', 'Gary Chen', 195000, 'F'),
('New Coreystad Archdukes', 'Charles Juarez', 190000, 'F'),
('Port Elizabeth Barbarians', 'Michael Stanley', 150000, 'D'),
('New Coreystad Archdukes', 'Scott Harrison', 150000, 'D'),
('Port Elizabeth Barbarians', 'Robert George', 195000, 'M');
SELECT player, salary, 
row_number() OVER (ORDER BY salary) AS row
FROM salaries;
┌─player──────────┬─salary─┬─row─┐
│ Michael Stanley │ 150000 │ 1 │
│ Scott Harrison │ 150000 │ 2 │
│ Charles Juarez │ 190000 │ 3 │
│ Gary Chen │ 195000 │ 4 │
│ Robert George │ 195000 │ 5 │
└─────────────────┴────────┴─────┘
SELECT player, salary, 
row_number() OVER (ORDER BY salary) AS row,
rank() OVER (ORDER BY salary) AS rank,
dense_rank() OVER (ORDER BY salary) AS denseRank
FROM salaries;
┌─player──────────┬─salary─┬─row─┬─rank─┬─denseRank─┐
│ Michael Stanley │ 150000 │ 1 │ 1 │ 1 │
│ Scott Harrison │ 150000 │ 2 │ 1 │ 1 │
│ Charles Juarez │ 190000 │ 3 │ 3 │ 2 │
│ Gary Chen │ 195000 │ 4 │ 4 │ 3 │
│ Robert George │ 195000 │ 5 │ 4 │ 3 │
└─────────────────┴────────┴─────┴──────┴───────────┘

聚合函数

将每个球员的薪水与他们所在球队的平均薪水进行比较。

SELECT player, salary, team,
avg(salary) OVER (PARTITION BY team) AS teamAvg,
salary - teamAvg AS diff
FROM salaries;
┌─player──────────┬─salary─┬─team──────────────────────┬─teamAvg─┬───diff─┐
│ Charles Juarez │ 190000 │ New Coreystad Archdukes │ 170000 │ 20000 │
│ Scott Harrison │ 150000 │ New Coreystad Archdukes │ 170000 │ -20000 │
│ Gary Chen │ 195000 │ Port Elizabeth Barbarians │ 180000 │ 15000 │
│ Michael Stanley │ 150000 │ Port Elizabeth Barbarians │ 180000 │ -30000 │
│ Robert George │ 195000 │ Port Elizabeth Barbarians │ 180000 │ 15000 │
└─────────────────┴────────┴───────────────────────────┴─────────┴────────┘

将每个球员的薪水与他们所在球队的最高薪水进行比较。

SELECT player, salary, team,
max(salary) OVER (PARTITION BY team) AS teamAvg,
salary - teamAvg AS diff
FROM salaries;
┌─player──────────┬─salary─┬─team──────────────────────┬─teamAvg─┬───diff─┐
│ Charles Juarez │ 190000 │ New Coreystad Archdukes │ 190000 │ 0 │
│ Scott Harrison │ 150000 │ New Coreystad Archdukes │ 190000 │ -40000 │
│ Gary Chen │ 195000 │ Port Elizabeth Barbarians │ 195000 │ 0 │
│ Michael Stanley │ 150000 │ Port Elizabeth Barbarians │ 195000 │ -45000 │
│ Robert George │ 195000 │ Port Elizabeth Barbarians │ 195000 │ 0 │
└─────────────────┴────────┴───────────────────────────┴─────────┴────────┘

按列分区

CREATE TABLE wf_partition
(
`part_key` UInt64,
`value` UInt64,
`order` UInt64
)
ENGINE = Memory;

INSERT INTO wf_partition FORMAT Values
(1,1,1), (1,2,2), (1,3,3), (2,0,0), (3,0,0);

SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key) AS frame_values
FROM wf_partition
ORDER BY
part_key ASC,
value ASC;

┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[1,2,3]<
122[1,2,3] │ │ 1-st group
133[1,2,3]<
200[0]<- 2-nd group
300[0]<- 3-d group
└──────────┴───────┴───────┴──────────────┘

框架边界

CREATE TABLE wf_frame
(
`part_key` UInt64,
`value` UInt64,
`order` UInt64
)
ENGINE = Memory;

INSERT INTO wf_frame FORMAT Values
(1,1,1), (1,2,2), (1,3,3), (1,4,4), (1,5,5);
-- Frame is bounded by bounds of a partition (BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
SELECT
part_key,
value,
order,
groupArray(value) OVER (
PARTITION BY part_key
ORDER BY order ASC
Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;

┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[1,2,3,4,5]
122[1,2,3,4,5]
133[1,2,3,4,5]
144[1,2,3,4,5]
155[1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- short form - no bound expression, no order by
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[1,2,3,4,5]
122[1,2,3,4,5]
133[1,2,3,4,5]
144[1,2,3,4,5]
155[1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- frame is bounded by the beginning of a partition and the current row
SELECT
part_key,
value,
order,
groupArray(value) OVER (
PARTITION BY part_key
ORDER BY order ASC
Rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;

┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[1]
122[1,2]
133[1,2,3]
144[1,2,3,4]
155[1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- short form (frame is bounded by the beginning of a partition and the current row)
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order ASC) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[1]
122[1,2]
133[1,2,3]
144[1,2,3,4]
155[1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- frame is bounded by the beginning of a partition and the current row, but order is backward
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order DESC) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[5,4,3,2,1]
122[5,4,3,2]
133[5,4,3]
144[5,4]
155[5]
└──────────┴───────┴───────┴──────────────┘
-- sliding frame - 1 PRECEDING ROW AND CURRENT ROW
SELECT
part_key,
value,
order,
groupArray(value) OVER (
PARTITION BY part_key
ORDER BY order ASC
Rows BETWEEN 1 PRECEDING AND CURRENT ROW
) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;

┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[1]
122[1,2]
133[2,3]
144[3,4]
155[4,5]
└──────────┴───────┴───────┴──────────────┘
-- sliding frame - Rows BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING 
SELECT
part_key,
value,
order,
groupArray(value) OVER (
PARTITION BY part_key
ORDER BY order ASC
Rows BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
111[1,2,3,4,5]
122[1,2,3,4,5]
133[2,3,4,5]
144[3,4,5]
155[4,5]
└──────────┴───────┴───────┴──────────────┘
-- row_number does not respect the frame, so rn_1 = rn_2 = rn_3 != rn_4
SELECT
part_key,
value,
order,
groupArray(value) OVER w1 AS frame_values,
row_number() OVER w1 AS rn_1,
sum(1) OVER w1 AS rn_2,
row_number() OVER w2 AS rn_3,
sum(1) OVER w2 AS rn_4
FROM wf_frame
WINDOW
w1 AS (PARTITION BY part_key ORDER BY order DESC),
w2 AS (
PARTITION BY part_key
ORDER BY order DESC
Rows BETWEEN 1 PRECEDING AND CURRENT ROW
)
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┬─rn_1─┬─rn_2─┬─rn_3─┬─rn_4─┐
111[5,4,3,2,1]5552
122[5,4,3,2]4442
133[5,4,3]3332
144[5,4]2222
155[5]1111
└──────────┴───────┴───────┴──────────────┴──────┴──────┴──────┴──────┘
-- first_value and last_value respect the frame
SELECT
groupArray(value) OVER w1 AS frame_values_1,
first_value(value) OVER w1 AS first_value_1,
last_value(value) OVER w1 AS last_value_1,
groupArray(value) OVER w2 AS frame_values_2,
first_value(value) OVER w2 AS first_value_2,
last_value(value) OVER w2 AS last_value_2
FROM wf_frame
WINDOW
w1 AS (PARTITION BY part_key ORDER BY order ASC),
w2 AS (PARTITION BY part_key ORDER BY order ASC Rows BETWEEN 1 PRECEDING AND CURRENT ROW)
ORDER BY
part_key ASC,
value ASC;
┌─frame_values_1─┬─first_value_1─┬─last_value_1─┬─frame_values_2─┬─first_value_2─┬─last_value_2─┐
[1]11[1]11
[1,2]12[1,2]12
[1,2,3]13[2,3]23
[1,2,3,4]14[3,4]34
[1,2,3,4,5]15[4,5]45
└────────────────┴───────────────┴──────────────┴────────────────┴───────────────┴──────────────┘
-- second value within the frame
SELECT
groupArray(value) OVER w1 AS frame_values_1,
nth_value(value, 2) OVER w1 AS second_value
FROM wf_frame
WINDOW w1 AS (PARTITION BY part_key ORDER BY order ASC Rows BETWEEN 3 PRECEDING AND CURRENT ROW)
ORDER BY
part_key ASC,
value ASC
┌─frame_values_1─┬─second_value─┐
[1]0
[1,2]2
[1,2,3]2
[1,2,3,4]2
[2,3,4,5]3
└────────────────┴──────────────┘
-- second value within the frame + Null for missing values
SELECT
groupArray(value) OVER w1 AS frame_values_1,
nth_value(toNullable(value), 2) OVER w1 AS second_value
FROM wf_frame
WINDOW w1 AS (PARTITION BY part_key ORDER BY order ASC Rows BETWEEN 3 PRECEDING AND CURRENT ROW)
ORDER BY
part_key ASC,
value ASC
┌─frame_values_1─┬─second_value─┐
[1] │ ᴺᵁᴸᴸ │
[1,2]2
[1,2,3]2
[1,2,3,4]2
[2,3,4,5]3
└────────────────┴──────────────┘

现实世界示例

以下示例解决了常见的现实世界问题。

每个部门的最高/总薪水

CREATE TABLE employees
(
`department` String,
`employee_name` String,
`salary` Float
)
ENGINE = Memory;

INSERT INTO employees FORMAT Values
('Finance', 'Jonh', 200),
('Finance', 'Joan', 210),
('Finance', 'Jean', 505),
('IT', 'Tim', 200),
('IT', 'Anna', 300),
('IT', 'Elen', 500);
SELECT
department,
employee_name AS emp,
salary,
max_salary_per_dep,
total_salary_per_dep,
round((salary / total_salary_per_dep) * 100, 2) AS `share_per_dep(%)`
FROM
(
SELECT
department,
employee_name,
salary,
max(salary) OVER wndw AS max_salary_per_dep,
sum(salary) OVER wndw AS total_salary_per_dep
FROM employees
WINDOW wndw AS (
PARTITION BY department
rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
ORDER BY
department ASC,
employee_name ASC
);

┌─department─┬─emp──┬─salary─┬─max_salary_per_dep─┬─total_salary_per_dep─┬─share_per_dep(%)─┐
│ Finance │ Jean │ 50550591555.19
│ Finance │ Joan │ 21050591522.95
│ Finance │ Jonh │ 20050591521.86
│ IT │ Anna │ 300500100030
│ IT │ Elen │ 500500100050
│ IT │ Tim │ 200500100020
└────────────┴──────┴────────┴────────────────────┴──────────────────────┴──────────────────┘

累积和

CREATE TABLE warehouse
(
`item` String,
`ts` DateTime,
`value` Float
)
ENGINE = Memory

INSERT INTO warehouse VALUES
('sku38', '2020-01-01', 9),
('sku38', '2020-02-01', 1),
('sku38', '2020-03-01', -4),
('sku1', '2020-01-01', 1),
('sku1', '2020-02-01', 1),
('sku1', '2020-03-01', 1);
SELECT
item,
ts,
value,
sum(value) OVER (PARTITION BY item ORDER BY ts ASC) AS stock_balance
FROM warehouse
ORDER BY
item ASC,
ts ASC;

┌─item──┬──────────────────ts─┬─value─┬─stock_balance─┐
│ sku1 │ 2020-01-01 00:00:0011
│ sku1 │ 2020-02-01 00:00:0012
│ sku1 │ 2020-03-01 00:00:0013
│ sku38 │ 2020-01-01 00:00:0099
│ sku38 │ 2020-02-01 00:00:00110
│ sku38 │ 2020-03-01 00:00:00-46
└───────┴─────────────────────┴───────┴───────────────┘

移动/滑动平均值(每 3 行)

CREATE TABLE sensors
(
`metric` String,
`ts` DateTime,
`value` Float
)
ENGINE = Memory;

insert into sensors values('cpu_temp', '2020-01-01 00:00:00', 87),
('cpu_temp', '2020-01-01 00:00:01', 77),
('cpu_temp', '2020-01-01 00:00:02', 93),
('cpu_temp', '2020-01-01 00:00:03', 87),
('cpu_temp', '2020-01-01 00:00:04', 87),
('cpu_temp', '2020-01-01 00:00:05', 87),
('cpu_temp', '2020-01-01 00:00:06', 87),
('cpu_temp', '2020-01-01 00:00:07', 87);
SELECT
metric,
ts,
value,
avg(value) OVER (
PARTITION BY metric
ORDER BY ts ASC
Rows BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_temp
FROM sensors
ORDER BY
metric ASC,
ts ASC;

┌─metric───┬──────────────────ts─┬─value─┬───moving_avg_temp─┐
│ cpu_temp │ 2020-01-01 00:00:008787
│ cpu_temp │ 2020-01-01 00:00:017782
│ cpu_temp │ 2020-01-01 00:00:029385.66666666666667
│ cpu_temp │ 2020-01-01 00:00:038785.66666666666667
│ cpu_temp │ 2020-01-01 00:00:048789
│ cpu_temp │ 2020-01-01 00:00:058787
│ cpu_temp │ 2020-01-01 00:00:068787
│ cpu_temp │ 2020-01-01 00:00:078787
└──────────┴─────────────────────┴───────┴───────────────────┘

移动/滑动平均值(每 10 秒)

SELECT
metric,
ts,
value,
avg(value) OVER (PARTITION BY metric ORDER BY ts
Range BETWEEN 10 PRECEDING AND CURRENT ROW) AS moving_avg_10_seconds_temp
FROM sensors
ORDER BY
metric ASC,
ts ASC;

┌─metric───┬──────────────────ts─┬─value─┬─moving_avg_10_seconds_temp─┐
│ cpu_temp │ 2020-01-01 00:00:008787
│ cpu_temp │ 2020-01-01 00:01:107777
│ cpu_temp │ 2020-01-01 00:02:209393
│ cpu_temp │ 2020-01-01 00:03:308787
│ cpu_temp │ 2020-01-01 00:04:408787
│ cpu_temp │ 2020-01-01 00:05:508787
│ cpu_temp │ 2020-01-01 00:06:008787
│ cpu_temp │ 2020-01-01 00:07:108787
└──────────┴─────────────────────┴───────┴────────────────────────────┘

移动/滑动平均值(每 10 天)

温度以秒精度存储,但使用RangeORDER BY toDate(ts),我们形成了一个大小为 10 个单位的框架,由于toDate(ts),单位是天。

CREATE TABLE sensors
(
`metric` String,
`ts` DateTime,
`value` Float
)
ENGINE = Memory;

insert into sensors values('ambient_temp', '2020-01-01 00:00:00', 16),
('ambient_temp', '2020-01-01 12:00:00', 16),
('ambient_temp', '2020-01-02 11:00:00', 9),
('ambient_temp', '2020-01-02 12:00:00', 9),
('ambient_temp', '2020-02-01 10:00:00', 10),
('ambient_temp', '2020-02-01 12:00:00', 10),
('ambient_temp', '2020-02-10 12:00:00', 12),
('ambient_temp', '2020-02-10 13:00:00', 12),
('ambient_temp', '2020-02-20 12:00:01', 16),
('ambient_temp', '2020-03-01 12:00:00', 16),
('ambient_temp', '2020-03-01 12:00:00', 16),
('ambient_temp', '2020-03-01 12:00:00', 16);
SELECT
metric,
ts,
value,
round(avg(value) OVER (PARTITION BY metric ORDER BY toDate(ts)
Range BETWEEN 10 PRECEDING AND CURRENT ROW),2) AS moving_avg_10_days_temp
FROM sensors
ORDER BY
metric ASC,
ts ASC;

┌─metric───────┬──────────────────ts─┬─value─┬─moving_avg_10_days_temp─┐
│ ambient_temp │ 2020-01-01 00:00:001616
│ ambient_temp │ 2020-01-01 12:00:001616
│ ambient_temp │ 2020-01-02 11:00:00912.5
│ ambient_temp │ 2020-01-02 12:00:00912.5
│ ambient_temp │ 2020-02-01 10:00:001010
│ ambient_temp │ 2020-02-01 12:00:001010
│ ambient_temp │ 2020-02-10 12:00:001211
│ ambient_temp │ 2020-02-10 13:00:001211
│ ambient_temp │ 2020-02-20 12:00:011613.33
│ ambient_temp │ 2020-03-01 12:00:001616
│ ambient_temp │ 2020-03-01 12:00:001616
│ ambient_temp │ 2020-03-01 12:00:001616
└──────────────┴─────────────────────┴───────┴─────────────────────────┘

参考

GitHub 问题

窗口函数的初始支持路线图在这个问题中

与窗口函数相关的所有 GitHub 问题都有comp-window-functions标签。

测试

这些测试包含当前支持的语法示例

https://github.com/ClickHouse/ClickHouse/blob/master/tests/performance/window_functions.xml

https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/01591_window_functions.sql

PostgreSQL 文档

https://postgresql.ac.cn/docs/current/sql-select.html#SQL-WINDOW

https://postgresql.ac.cn/docs/devel/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS

https://postgresql.ac.cn/docs/devel/functions-window.html

https://postgresql.ac.cn/docs/devel/tutorial-window.html

MySQL 文档

https://dev.mysqlserver.cn/doc/refman/8.0/en/window-function-descriptions.html

https://dev.mysqlserver.cn/doc/refman/8.0/en/window-functions-usage.html

https://dev.mysqlserver.cn/doc/refman/8.0/en/window-functions-frames.html