博客 / 工程

使用 ClickHouse 机器学习函数进行预测

Ensemble

Oct 26, 2023 - 13 分钟阅读

这最初是 ensemble analytics 发布的一篇文章，他们慷慨地允许重新发布此内容。我们欢迎来自社区的文章，并感谢他们的贡献。

在进行统计分析或数据科学工作时，人们通常倾向于尽早转向 Python 或 R 等编程语言。

然而，当我们使用 ClickHouse 时，我们更倾向于尽可能只使用数据库来完成任务。这样做，我们可以依靠 ClickHouse 的强大功能来快速处理数据，并减少甚至完全避免我们需要编写的代码量。这也意味着我们可以在客户端处理更小的内存数据集，并避免对分布式计算的需求。

预测就是一个很好的例子。 ClickHouse 实现了两个机器学习函数 - 随机线性回归 (stochasticLinearRegression)，可用于拟合模型，以及一个函数 (evalMLMethod)，可用于直接在数据库中进行后续推理。

当然，一旦你跳出 SQL 进入成熟的编程语言，就会有更复杂的预测模型和更大的灵活性，但这种技术肯定有它的用途，并且在我们这里的演示场景中表现良好。

为了演示，我们将使用一个简单的航班起飞数据集，其中包含按月统计的不同机场和不同航空公司出发的乘客人数的时间序列。

我们的目标是获取这些数据，并使用它来预测未来的相同数据。

我们的目标是使用 2008 年至 2015 年的数据构建模型，然后测试该模型在 2015 年至 2018 年期间的性能。最后，我们将预测到 2021 年以后的时期。

我们的源数据具有以下结构

SELECT *
FROM flight_data
LIMIT 10

┌─AIRLINE─┬─DEPARTURE_AIRPORT─┬──────MONTH─┬─PASSENGERS─┐
│ Delta   │ DIA               │ 2008-01-01 │        434 │
│ Delta   │ DIA               │ 2008-02-01 │        475 │
│ Delta   │ DIA               │ 2008-03-01 │        531 │
│ Delta   │ DIA               │ 2008-04-01 │        509 │
│ Delta   │ DIA               │ 2008-05-01 │        472 │
│ Delta   │ DIA               │ 2008-06-01 │        562 │
│ Delta   │ DIA               │ 2008-07-01 │        642 │
│ Delta   │ DIA               │ 2008-08-01 │        642 │
│ Delta   │ DIA               │ 2008-09-01 │        596 │
│ Delta   │ DIA               │ 2008-10-01 │        503 │
└─────────┴───────────────────┴────────────┴────────────┘

10 rows in set. Elapsed: 0.002 sec. Processed 4.62 thousand rows, 151.54 KB (2.16 million rows/s., 70.86 MB/s.)
Peak memory usage: 229.15 KiB.

绘制后，数据看起来像这样，显示了随着时间的推移，所有航空公司的乘客数量都在增加，并且具有显着的季节性效应。

我们的预测模型使用 13 个确定性特征：线性时间趋势和代表一年中 12 个月的 12 个虚拟（或独热编码）变量。我们排除了常数项（或截距），以避免“虚拟变量陷阱”。

该模型预测乘客人数的对数。对数变换使我们能够更好地捕捉季节性波动的时变幅度。

CREATE VIEW
    data
AS WITH
    (select toDate(min(MONTH)) from flight_data) as start_date,
    (select toDate(max(MONTH)) from flight_data) as end_date
SELECT
    AIRLINE,
    DEPARTURE_AIRPORT,
    MONTH,
    toFloat64(log(PASSENGERS)) as Target,
    assumeNotNull(dateDiff('month', start_date, MONTH) / dateDiff('month', start_date, end_date)) as Trend,
    if(toMonth(toDate(MONTH)) = 1, 1, 0) as Dummy1,
    if(toMonth(toDate(MONTH)) = 2, 1, 0) as Dummy2,
    if(toMonth(toDate(MONTH)) = 3, 1, 0) as Dummy3,
    if(toMonth(toDate(MONTH)) = 4, 1, 0) as Dummy4,
    if(toMonth(toDate(MONTH)) = 5, 1, 0) as Dummy5,
    if(toMonth(toDate(MONTH)) = 6, 1, 0) as Dummy6,
    if(toMonth(toDate(MONTH)) = 7, 1, 0) as Dummy7,
    if(toMonth(toDate(MONTH)) = 8, 1, 0) as Dummy8,
    if(toMonth(toDate(MONTH)) = 9, 1, 0) as Dummy9,
    if(toMonth(toDate(MONTH)) = 10, 1, 0) as Dummy10,
    if(toMonth(toDate(MONTH)) = 11, 1, 0) as Dummy11,
    if(toMonth(toDate(MONTH)) = 12, 1, 0) as Dummy12
FROM
    flight_data
ORDER BY AIRLINE, DEPARTURE_AIRPORT, MONTH

这创建了以下视图，总结了我们的因变量和自变量

SELECT *
FROM data
LIMIT 10

┌─AIRLINE─┬─DEPARTURE_AIRPORT─┬──────MONTH─┬─────────────Target─┬────────────────Trend─┬─Dummy1─┬─Dummy2─┬─Dummy3─┬─Dummy4─┬─Dummy5─┬─Dummy6─┬─Dummy7─┬─Dummy8─┬─Dummy9─┬─Dummy10─┬─Dummy11─┬─Dummy12─┐
│ Delta   │ DIA               │ 2008-01-01 │ 6.0730445333335865 │                    0 │      1 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-02-01 │  6.163314804336003 │ 0.007633587786259542 │      0 │      1 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-03-01 │  6.274762021388925 │ 0.015267175572519083 │      0 │      0 │      1 │      0 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-04-01 │  6.232448016554782 │ 0.022900763358778626 │      0 │      0 │      0 │      1 │      0 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-05-01 │  6.156978985873825 │ 0.030534351145038167 │      0 │      0 │      0 │      0 │      1 │      0 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-06-01 │ 6.3315018500618665 │  0.03816793893129771 │      0 │      0 │      0 │      0 │      0 │      1 │      0 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-07-01 │  6.464588304624293 │  0.04580152671755725 │      0 │      0 │      0 │      0 │      0 │      0 │      1 │      0 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-08-01 │  6.464588304624293 │  0.05343511450381679 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      1 │      0 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-09-01 │  6.390240666362644 │ 0.061068702290076333 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      1 │       0 │       0 │       0 │
│ Delta   │ DIA               │ 2008-10-01 │  6.220590170138575 │  0.06870229007633588 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │      0 │       1 │       0 │       0 │
└─────────┴───────────────────┴────────────┴────────────────────┴──────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴─────────┴─────────┴─────────┘

10 rows in set. Elapsed: 0.010 sec. Processed 13.86 thousand rows, 170.02 KB (1.37 million rows/s., 16.81 MB/s.)
Peak memory usage: 420.28 KiB.

我们使用 ClickHouse 的 stochasticLinearRegression 算法，该算法使用梯度下降训练线性回归。我们同时构建 35 个不同的模型，每个航空公司-机场组合一个模型。

CREATE VIEW model as SELECT
    AIRLINE,
    DEPARTURE_AIRPORT,
    stochasticLinearRegressionState(0.5, 0.01, 4, 'SGD')(
        Target, Trend, Dummy1, Dummy2, Dummy3, Dummy4, Dummy5, Dummy6, Dummy7, Dummy8, Dummy9, Dummy10, Dummy11, Dummy12
    ) as state
FROM train_data
GROUP BY AIRLINE, DEPARTURE_AIRPORT

由于数据量较小，该模型仅定义为视图。对于更大的数据集，我们可以选择将其物化为表或视图。

现在我们可以使用训练好的模型来生成测试集上的预测，并将它们与实际值进行比较。在这个阶段，我们还可以通过取指数将数据和预测结果转换回原始尺度。

SELECT
    a.MONTH as MONTH,
    a.AIRLINE as AIRLINE,
    a.DEPARTURE_AIRPORT as DEPARTURE_AIRPORT,
    toInt32(exp(a.Target)) as ACTUAL,
    toInt32(exp(evalMLMethod(b.state, Trend, Dummy1, Dummy2, Dummy3, Dummy4, Dummy5, Dummy6, Dummy7,
    Dummy8, Dummy9, Dummy10, Dummy11, Dummy12))) as FORECAST
FROM test_data as a
LEFT JOIN model as b
on a.AIRLINE = b.AIRLINE and a.DEPARTURE_AIRPORT = b.DEPARTURE_AIRPORT

如果我们将预测值与实际值进行比较，我们可以看到预测效果良好

我们可以通过计算每个航空公司-机场组合的预测的平均绝对误差 (MAE) 和均方根误差 (RMSE) 来验证这一点。

SELECT
    AIRLINE,
    DEPARTURE_AIRPORT,
    avg(abs(ERROR)) AS MAE,
    sqrt(avg(pow(ERROR, 2))) AS RMSE
FROM
(
    SELECT
        a.AIRLINE AS AIRLINE,
        a.DEPARTURE_AIRPORT AS DEPARTURE_AIRPORT,
        toInt32(exp(a.Target)) - toInt32(exp(evalMLMethod(b.state, Trend, Dummy1, Dummy2, Dummy3, Dummy4,
        Dummy5, Dummy6, Dummy7, Dummy8, Dummy9, Dummy10, Dummy11, Dummy12))) AS ERROR
    FROM test_data AS a
    LEFT JOIN model AS b ON (a.AIRLINE = b.AIRLINE) AND (a.DEPARTURE_AIRPORT = b.DEPARTURE_AIRPORT)
)
GROUP BY
    AIRLINE,
    DEPARTURE_AIRPORT

Query id: 320cad46-bb31-4248-bd25-19d98d5d2d15

┌─AIRLINE──┬─DEPARTURE_AIRPORT─┬────────────────MAE─┬───────────────RMSE─┐
│ JetBlue  │ SFO               │  86.38888888888889 │ 110.96671172523367 │
│ KLM      │ PDX               │ 167.97222222222223 │  213.4134615143936 │
│ Delta    │ SJC               │ 141.80555555555554 │  180.9452802491528 │
│ United   │ PDX               │ 115.19444444444444 │  147.7711255812703 │
│ JetBlue  │ ORL               │  97.77777777777777 │ 125.28611699271038 │
│ KLM      │ JAX               │ 121.27777777777777 │ 155.41414207064798 │
│ Delta    │ JFK               │              168.5 │  214.1754213515433 │
│ United   │ JAX               │ 153.88888888888889 │  195.9098432102549 │
│ Delta    │ SFO               │ 184.66666666666666 │ 234.34068267280344 │
│ KLM      │ DIA               │ 148.94444444444446 │ 189.77618396416344 │
│ United   │ JFK               │ 178.02777777777777 │   226.086205289536 │
│ Frontier │ ORL               │ 206.38888888888889 │ 261.27720485679146 │
│ United   │ SJC               │ 119.91666666666667 │ 153.72332650288018 │
│ KLM      │ SJC               │ 218.13888888888889 │ 275.90532796595284 │
│ KLM      │ JFK               │  70.30555555555556 │  90.43244869944515 │
│ Delta    │ JAX               │ 186.55555555555554 │ 236.69213477990067 │
│ Delta    │ ORL               │  74.44444444444444 │  95.50887102486577 │
│ Frontier │ SFO               │  63.02777777777778 │  80.91748197323548 │
│ Frontier │ PDX               │                 81 │ 103.99278821149089 │
│ United   │ ORL               │              111.5 │ 142.90031490518138 │
│ Frontier │ JAX               │  98.11111111111111 │ 125.86147588166568 │
│ Frontier │ DIA               │  95.91666666666667 │ 122.96758832219886 │
│ Delta    │ PDX               │  72.41666666666667 │  92.89046715830904 │
│ JetBlue  │ JFK               │ 141.91666666666666 │ 181.17877911057906 │
│ JetBlue  │ SJC               │              209.5 │  265.1057441013973 │
│ JetBlue  │ JAX               │ 107.30555555555556 │ 137.61893845769274 │
│ KLM      │ ORL               │ 156.77777777777777 │ 199.51287900506296 │
│ JetBlue  │ DIA               │  76.83333333333333 │  98.60076628054729 │
│ Frontier │ SJC               │  97.22222222222223 │  124.6602048236191 │
│ Frontier │ JFK               │ 156.33333333333334 │ 199.04550010264265 │
│ Delta    │ DIA               │                114 │  146.3065655092454 │
│ KLM      │ SFO               │ 119.97222222222223 │  153.7722883573847 │
│ United   │ DIA               │  72.63888888888889 │  93.25666493905706 │
│ JetBlue  │ PDX               │ 147.83333333333334 │  188.4872527372725 │
│ United   │ SFO               │ 186.83333333333334 │ 237.06668072740865 │
└──────────┴───────────────────┴────────────────────┴────────────────────┘

35 rows in set. Elapsed: 0.024 sec. Processed 18.48 thousand rows, 321.55 KB (785.99 thousand rows/s., 13.68 MB/s.)
Peak memory usage: 766.46 KiB.

最后，我们现在可以使用该模型生成数据集中最后一个日期之后的预测。为此，我们创建了一个新表，其中包含未来 3 年的日期及其相应的转换（时间趋势和虚拟变量）。

CREATE VIEW
    future_data
AS WITH
    (select toDate(min(MONTH)) from flight_data) as start_date,
    (select toDate(max(MONTH)) from flight_data) as end_date
SELECT
    AIRLINE,
    DEPARTURE_AIRPORT,
    MONTH + INTERVAL 3 YEAR as MONTH,
    assumeNotNull(dateDiff('month', start_date, MONTH) / dateDiff('month', start_date, end_date)) as Trend,
    if(toMonth(toDate(MONTH)) = 1, 1, 0) as Dummy1,
    if(toMonth(toDate(MONTH)) = 2, 1, 0) as Dummy2,
    if(toMonth(toDate(MONTH)) = 3, 1, 0) as Dummy3,
    if(toMonth(toDate(MONTH)) = 4, 1, 0) as Dummy4,
    if(toMonth(toDate(MONTH)) = 5, 1, 0) as Dummy5,
    if(toMonth(toDate(MONTH)) = 6, 1, 0) as Dummy6,
    if(toMonth(toDate(MONTH)) = 7, 1, 0) as Dummy7,
    if(toMonth(toDate(MONTH)) = 8, 1, 0) as Dummy8,
    if(toMonth(toDate(MONTH)) = 9, 1, 0) as Dummy9,
    if(toMonth(toDate(MONTH)) = 10, 1, 0) as Dummy10,
    if(toMonth(toDate(MONTH)) = 11, 1, 0) as Dummy11,
    if(toMonth(toDate(MONTH)) = 12, 1, 0) as Dummy12
FROM
    test_data
ORDER BY AIRLINE, DEPARTURE_AIRPORT, MONTH