使用 JupySQL 与 ClickHouse
在本指南中,我们将展示与 ClickHouse 的集成。
我们将使用 JupySQL 在 ClickHouse 之上运行查询。一旦数据加载完成,我们将通过 SQL 绘图将其可视化。
JupySQL 和 ClickHouse 之间的集成是通过使用 clickhouse_sqlalchemy 库实现的。该库允许两个系统之间轻松通信,并使用户能够连接到 ClickHouse 并传递 SQL 方言。连接后,用户可以直接从 Clickhouse 原生 UI 或直接从 Jupyter notebook 运行 SQL 查询。
# Install required packages
%pip install --quiet jupysql clickhouse_sqlalchemy
注意:您可能需要重启内核以使用更新的软件包。
import pandas as pd
from sklearn_evaluation import plot
# Import jupysql Jupyter extension to create SQL cells
%load_ext sql
%config SqlMagic.autocommit=False
您需要确保您的 Clickhouse 已启动并可访问,以便进行后续步骤。您可以使用本地版本或云版本。
注意: 您将需要根据您尝试连接的实例类型(url、用户、密码)调整连接字符串。在下面的示例中,我们使用了本地实例。要了解更多信息,请查看本指南。
%sql clickhouse://default:@localhost:8123/default
%%sql
CREATE TABLE trips
(
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
- clickhouse://default:***@localhost:8123/default 完成。
%%sql
INSERT INTO trips
SELECT * FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
'TabSeparatedWithNames', "
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
") SETTINGS input_format_try_infer_datetimes = 0
- clickhouse://default:***@localhost:8123/default 完成。
%sql SELECT count() FROM trips limit 5;
- clickhouse://default:***@localhost:8123/default 完成。
count() |
---|
1999657 |
%sql SELECT DISTINCT(pickup_ntaname) FROM trips limit 5;
- clickhouse://default:***@localhost:8123/default 完成。
pickup_ntaname |
---|
晨边高地 |
哈德逊广场-切尔西-熨斗区-联合广场 |
中城-中城南 |
苏荷区-翠贝卡-市政中心-小意大利 |
莫瑞山-基普斯湾 |
%sql SELECT round(avg(tip_amount), 2) FROM trips
- clickhouse://default:***@localhost:8123/default 完成。
round(avg(tip_amount), 2) |
---|
1.68 |
%%sql
SELECT
passenger_count,
ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count
- clickhouse://default:***@localhost:8123/default 完成。
passenger_count | average_total_amount |
---|---|
0 | 22.69 |
1 | 15.97 |
2 | 17.15 |
3 | 16.76 |
4 | 17.33 |
5 | 16.35 |
6 | 16.04 |
7 | 59.8 |
8 | 36.41 |
9 | 9.81 |
%%sql
SELECT
pickup_date,
pickup_ntaname,
SUM(1) AS number_of_trips
FROM trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASC
limit 5;
- clickhouse://default:***@localhost:8123/default 完成。
pickup_date | pickup_ntaname | number_of_trips |
---|---|---|
2015-07-01 | 布什维克北部 | 2 |
2015-07-01 | 布莱顿海滩 | 1 |
2015-07-01 | 布莱尔伍德-牙买加山 | 3 |
2015-07-01 | 威廉斯堡 | 1 |
2015-07-01 | 皇后桥-拉文斯伍德-长岛市 | 9 |
# %sql DESCRIBE trips;
# %sql SELECT DISTINCT(trip_distance) FROM trips limit 50;
%%sql --save short-trips --no-execute
SELECT *
FROM trips
WHERE trip_distance < 6.3
- clickhouse://default:***@localhost:8123/default 跳过执行...
%sqlplot histogram --table short-trips --column trip_distance --bins 10 --with short-trips
<AxesSubplot: title={'center': "'trip_distance' from 'short-trips'"}, xlabel='trip_distance', ylabel='Count'>
ax = %sqlplot histogram --table short-trips --column trip_distance --bins 50 --with short-trips
ax.grid()
ax.set_title("Trip distance from trips < 6.3")
_ = ax.set_xlabel("Trip distance")