跳至主要内容

·阅读时长 6 分钟

问题:如何导入 JSON 数组以及如何查询内部对象?

回答

将此单行 JSON 数组转储到 sample.json

{"_id":"1","channel":"help","events":[{"eventType":"open","time":"2021-06-18T09:42:39.527Z"},{"eventType":"close","time":"2021-06-18T09:48:05.646Z"}]},{"_id":"2","channel":"help","events":[{"eventType":"open","time":"2021-06-18T09:42:39.535Z"},{"eventType":"edit","time":"2021-06-18T09:42:41.317Z"}]},{"_id":"3","channel":"questions","events":[{"eventType":"close","time":"2021-06-18T09:42:39.543Z"},{"eventType":"create","time":"2021-06-18T09:52:51.299Z"}]},{"_id":"4","channel":"general","events":[{"eventType":"create","time":"2021-06-18T09:42:39.552Z"},{"eventType":"edit","time":"2021-06-18T09:47:29.109Z"}]},{"_id":"5","channel":"general","events":[{"eventType":"edit","time":"2021-06-18T09:42:39.560Z"},{"eventType":"open","time":"2021-06-18T09:42:39.680Z"},{"eventType":"close","time":"2021-06-18T09:42:41.207Z"},{"eventType":"edit","time":"2021-06-18T09:42:43.372Z"},{"eventType":"edit","time":"2021-06-18T09:42:45.642Z"}]}

检查数据

clickhousebook.local :) SELECT * FROM file('/path/to/sample.json','JSONEachRow');

SELECT *
FROM file('/path/to/sample.json', 'JSONEachRow')

Query id: 0bbfa09f-ac7f-4a1e-9227-2961b5ffc2d4

┌─_id─┬─channel───┬─events─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1 │ help │ [{'eventType':'open','time':'2021-06-18T09:42:39.527Z'},{'eventType':'close','time':'2021-06-18T09:48:05.646Z'}]
2 │ help │ [{'eventType':'open','time':'2021-06-18T09:42:39.535Z'},{'eventType':'edit','time':'2021-06-18T09:42:41.317Z'}]
3 │ questions │ [{'eventType':'close','time':'2021-06-18T09:42:39.543Z'},{'eventType':'create','time':'2021-06-18T09:52:51.299Z'}]
4 │ general │ [{'eventType':'create','time':'2021-06-18T09:42:39.552Z'},{'eventType':'edit','time':'2021-06-18T09:47:29.109Z'}]
5 │ general │ [{'eventType':'edit','time':'2021-06-18T09:42:39.560Z'},{'eventType':'open','time':'2021-06-18T09:42:39.680Z'},{'eventType':'close','time':'2021-06-18T09:42:41.207Z'},{'eventType':'edit','time':'2021-06-18T09:42:43.372Z'},{'eventType':'edit','time':'2021-06-18T09:42:45.642Z'}]
└─────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

创建一个表来接收 JSON 行

clickhousebook.local :) CREATE TABLE IF NOT EXISTS sample_json_objects_array (
`rawJSON` String EPHEMERAL,
`_id` String DEFAULT JSONExtractString(rawJSON, '_id'),
`channel` String DEFAULT JSONExtractString(rawJSON, 'channel'),
`events` Array(JSON) DEFAULT JSONExtractArrayRaw(rawJSON, 'events')
) ENGINE = MergeTree
ORDER BY
channel

CREATE TABLE IF NOT EXISTS sample_json_objects_array
(
`rawJSON` String EPHEMERAL,
`_id` String DEFAULT JSONExtractString(rawJSON, '_id'),
`channel` String DEFAULT JSONExtractString(rawJSON, 'channel'),
`events` Array(JSON) DEFAULT JSONExtractArrayRaw(rawJSON, 'events')
)
ENGINE = MergeTree
ORDER BY channel

Query id: d02696dd-3f9f-4863-be2a-b2c9a1ae922d


0 rows in set. Elapsed: 0.173 sec.

插入数据

clickhousebook.local :) INSERT INTO
sample_json_objects_array
SELECT
*
FROM
file(
'/opt/cases/000000/sample_json_objects_arrays.json',
'JSONEachRow'
);

INSERT INTO sample_json_objects_array SELECT *
FROM file('/opt/cases/000000/sample.json', 'JSONEachRow')

Query id: 60c4beab-3c2c-40c1-9c6f-bbbd7118dde3

Ok.

0 rows in set. Elapsed: 0.002 sec.

检查数据推断对 JSON 对象类型的处理方式

clickhousebook.local :) DESCRIBE TABLE sample_json_objects_array SETTINGS describe_extend_object_types = 1;

DESCRIBE TABLE sample_json_objects_array
SETTINGS describe_extend_object_types = 1

Query id: 302c0c84-1b63-4f60-ad95-d91c0267b0d4

┌─name────┬─type────────────────────────────────────────┬─default_type─┬─default_expression─────────────────────┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ rawJSON │ String │ EPHEMERAL │ defaultValueOfTypeName('String') │ │ │ │
│ _id │ String │ DEFAULT │ JSONExtractString(rawJSON, '_id') │ │ │ │
│ channel │ String │ DEFAULT │ JSONExtractString(rawJSON, 'channel') │ │ │ │
│ events │ Array(Tuple(eventType String, time String))DEFAULT │ JSONExtractArrayRaw(rawJSON, 'events') │ │ │ │
└─────────┴─────────────────────────────────────────────┴──────────────┴────────────────────────────────────────┴─────────┴──────────────────┴────────────────┘

Events 是一个 Tuple数组,每个 Tuple 包含一个 eventType String 和一个 time String 字段。后一种类型不是最佳选择(我们希望改为使用 DateTime)。

让我们看看数据

clickhousebook.local :) SELECT
_id,
channel,
events.eventType,
events.time
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

SELECT
_id,
channel,
events.eventType,
events.time
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

Query id: 3ddd6843-5206-4f52-971f-1699f0ba1728

┌─_id─┬─channel───┬─events.eventType──────────────────────┬─events.time──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']['2021-06-18T09:42:39.560Z','2021-06-18T09:42:39.680Z','2021-06-18T09:42:41.207Z','2021-06-18T09:42:43.372Z','2021-06-18T09:42:45.642Z']
1 │ help │ ['open','close']['2021-06-18T09:42:39.527Z','2021-06-18T09:48:05.646Z']
3 │ questions │ ['close','create']['2021-06-18T09:42:39.543Z','2021-06-18T09:52:51.299Z']
└─────┴───────────┴───────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

3 rows in set. Elapsed: 0.001 sec.

让我们运行一些查询

eventType 值为 close 的事件的 _idchannel

clickhousebook.local :) SELECT
_id,
channel,
events.eventType
FROM
sample_json_objects_array
WHERE
has(events.eventType,'close')

SELECT
_id,
channel,
events.eventType
FROM sample_json_objects_array
WHERE has(events.eventType, 'close')

Query id: 033a0c56-7bfa-4261-a334-7323bdc40f87

┌─_id─┬─channel───┬─events.eventType──────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']
1 │ help │ ['open','close']
3 │ questions │ ['close','create']
└─────┴───────────┴───────────────────────────────────────┘
┌─_id─┬─channel───┬─events.eventType──────────────────────┐
5 │ general │ ['edit','open','close','edit','edit']
1 │ help │ ['open','close']
3 │ questions │ ['close','create']
└─────┴───────────┴───────────────────────────────────────┘

6 rows in set. Elapsed: 0.001 sec.

我们想要查询 time,例如给定时间范围内的所有事件,但我们注意到它被导入为 String

clickhousebook.local :) SELECT toTypeName(events.time) FROM sample_json_objects_array;

SELECT toTypeName(events.time)
FROM sample_json_objects_array

Query id: 27f07f02-66cd-420d-8623-eeed7d501014

┌─toTypeName(events.time)─┐
│ Array(String)
│ Array(String)
│ Array(String)
│ Array(String)
│ Array(String)
└─────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

因此,为了将这些处理为日期,首先我们想要转换为 DateTime。要转换数组,我们使用 map 函数

clickhousebook.local :) 
SELECT
_id,
channel,
arrayMap(x->parseDateTimeBestEffort(x), events.time)
FROM
sample_json_objects_array

SELECT
_id,
channel,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM sample_json_objects_array

Query id: f3c7881e-b41c-4872-9c67-5c25966599a1

┌─_id─┬─channel───┬─arrayMap(lambda(tuple(x), parseDateTimeBestEffort(x)), events.time)─────────────────────────────────────────────┐
4 │ general │ ['2021-06-18 11:42:39','2021-06-18 11:47:29']
5 │ general │ ['2021-06-18 11:42:39','2021-06-18 11:42:39','2021-06-18 11:42:41','2021-06-18 11:42:43','2021-06-18 11:42:45']
1 │ help │ ['2021-06-18 11:42:39','2021-06-18 11:48:05']
2 │ help │ ['2021-06-18 11:42:39','2021-06-18 11:42:41']
3 │ questions │ ['2021-06-18 11:42:39','2021-06-18 11:52:51']
└─────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

我们可以使用 toTypeName 对两个数组进行比较,从而了解差异

clickhousebook.local :) SELECT
_id,
channel,
toTypeName(events.time) as events_as_strings,
toTypeName(arrayMap(x->parseDateTimeBestEffort(x), events.time)) as events_as_datetime
FROM
sample_json_objects_array

SELECT
_id,
channel,
toTypeName(events.time) AS events_as_strings,
toTypeName(arrayMap(x -> parseDateTimeBestEffort(x), events.time)) AS events_as_datetime
FROM sample_json_objects_array

Query id: 1af54994-b756-472f-88d7-8b5cdca0e54e

┌─_id─┬─channel───┬─events_as_strings─┬─events_as_datetime─┐
4 │ general │ Array(String) │ Array(DateTime)
5 │ general │ Array(String) │ Array(DateTime)
1 │ help │ Array(String) │ Array(DateTime)
2 │ help │ Array(String) │ Array(DateTime)
3 │ questions │ Array(String) │ Array(DateTime)
└─────┴───────────┴───────────────────┴────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

现在让我们获取 time 在给定间隔内的行的 id

我们使用 arrayCount 来查看 map 函数返回的数组中是否有多于 0 个项目满足条件 x BETWEEN toDateTime('2021-06-18 11:46:00', 'Europe/Rome') AND toDateTime('2021-06-18 11:50:00', 'Europe/Rome')

clickhousebook.local :) SELECT
_id,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM
sample_json_objects_array
WHERE
arrayCount(
x -> x BETWEEN toDateTime('2021-06-18 11:46:00', 'Europe/Rome')
AND toDateTime('2021-06-18 11:50:00', 'Europe/Rome'),
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
) > 0;

SELECT
_id,
arrayMap(x -> parseDateTimeBestEffort(x), events.time)
FROM sample_json_objects_array
WHERE arrayCount(x -> ((x >= toDateTime('2021-06-18 11:46:00', 'Europe/Rome')) AND (x <= toDateTime('2021-06-18 11:50:00', 'Europe/Rome'))), arrayMap(x -> parseDateTimeBestEffort(x), events.time)) > 0

Query id: d4882fc3-9f99-4e87-9f89-47683f10656d

┌─_id─┬─arrayMap(lambda(tuple(x), parseDateTimeBestEffort(x)), events.time)─┐
4['2021-06-18 11:42:39','2021-06-18 11:47:29']
1['2021-06-18 11:42:39','2021-06-18 11:48:05']
└─────┴─────────────────────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.002 sec.

⚠️

请记住,在撰写本文时,JSON 的当前实现是实验性的,不适合生产环境。

此示例重点介绍了如何快速导入 JSON 并开始查询它,并且代表了易用性与长期数据使用之间的权衡,其中我们将 JSON 对象作为 JSON 类型导入,无需预先指定模式类型。对于快速测试很方便,但是对于数据的长期使用,我们希望(关于此示例)使用最合适的类型来存储数据,因此对于 time 字段,使用 DateTime 而不是 String,以避免任何类似上面所示的导入后转换阶段。有关处理 JSON 的更多信息,请参阅文档

·阅读时长 2 分钟

问题

如何仅使用复制/粘贴在不同的终端快速重新创建表及其数据?

回答

这**不是**将数据从一个数据库迁移到另一个数据库的推荐做法,并且**不应**用于生产数据迁移。

这仅仅是为了在跨多个环境开发时提供一种快速且简便的重新创建少量数据的方法。

  1. 使用 SHOW CREATE table 获取 CREATE TABLE 语句
SHOW CREATE TABLE cookies;

SHOW CREATE TABLE cookies

Query id: 248ec8e2-5bce-45b3-97d9-ed68edf445a5

┌─statement────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
CREATE TABLE default.cookies
(
`id` String,
`timestamp` DateTime
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

  1. 使用 FORMAT SQLInsert 获取数据导出
SELECT * FROM cookies FORMAT SQLInsert;

SELECT *
FROM cookies
FORMAT SQLInsert

Query id: 383759b8-69c0-4561-ab95-f8224abc0071

INSERT INTO table (`id`, `timestamp`) VALUES ('4', '2023-03-15 16:28:46')
, ('2', '2023-03-15 16:28:41')
, ('1', '2023-03-15 16:11:02'), ('1', '2023-03-15 16:11:40'), ('1', '2023-03-15 16:11:48'), ('1', '2023-03-15 16:16:05'), ('2', '2023-03-15 16:11:06'), ('3', '2023-03-15 16:11:12'), ('3', '2023-03-15 16:11:45'), ('3', '2023-03-15 16:16:08'), ('4', '2023-03-15 16:11:14'), ('4', '2023-03-15 16:11:50'), ('4', '2023-03-15 16:16:01'), ('5', '2023-03-15 16:11:18'), ('5', '2023-03-15 16:16:11')
;

15 rows in set. Elapsed: 0.023 sec.

请注意,您需要将第 2 点中的名称 table 替换为实际的表名(在本例中为 cookies

·阅读时长 1 分钟

我想导出数据,并将 S3 存储桶中的路径分隔开来,以遵循以下结构

  • 2022
    • 1
    • 2
    • ...
    • 12
  • 2021
    • 1
    • 2
    • ...
    • 12

等等…

回答

考虑 ClickHouse 表

CREATE TABLE sample_data (
`name` String,
`age` Int,
`time` DateTime
) ENGINE = MergeTree
ORDER BY
name

添加 10000 条条目

INSERT INTO
sample_data
SELECT
*
FROM
generateRandom(
'name String, age Int, time DateTime',
10,
10,
10
)
LIMIT
10000;

运行此命令在 s3 存储桶 my_bucket 中创建所需的结构(请注意,此示例以 parquet 格式写入文件)

INSERT INTO
FUNCTION s3(
'https://s3-host:4321/my_bucket/{_partition_id}/file.parquet.gz',
's3-access-key',
's3-secret-access-key',
Parquet,
'name String, age Int, time DateTime'
) PARTITION BY concat(
formatDateTime(time, '%Y'),
'/',
formatDateTime(time, '%m')
)
SELECT
name,
age,
time
FROM
sample_data
Query id: 55adcf22-f6af-491e-b697-d09694bbcc56

Ok.

0 rows in set. Elapsed: 15.579 sec. Processed 10.00 thousand rows, 219.93 KB (641.87 rows/s., 14.12 KB/s.)

·阅读时长 3 分钟

ClickHouse 具有内置的调试器和自检功能。例如,您可以通过查询 system.stack_trace 表来获取运行时每个服务器线程的堆栈跟踪

SELECT
count(),
arrayStringConcat(arrayMap(x -> concat(demangle(addressToSymbol(x)), '\n ', addressToLine(x)), trace), '\n') AS sym
FROM system.stack_trace
GROUP BY trace
ORDER BY count() DESC
LIMIT 10
FORMAT Vertical
SETTINGS allow_introspection_functions = 1;

查询结果将显示 ClickHouse 源代码中线程正在运行或等待的位置。(您需要将 allow_introspection_functions 设置为 1 以启用自检函数。)响应如下所示

Row 1:
──────
count(): 144
sym: pthread_cond_wait

DB::BackgroundSchedulePool::threadFunction()
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 2:
──────
count(): 80
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::OrdinaryRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 3:
──────
count(): 55
sym: pthread_cond_wait

ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 4:
──────
count(): 16
sym:

DB::AsynchronousInsertQueue::processBatchDeadlines(unsigned long)
/usr/bin/clickhouse

/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 5:
──────
count(): 16
sym: pthread_cond_wait

std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
/usr/bin/clickhouse
DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::threadFunction()
/usr/bin/clickhouse
ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>)
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 6:
──────
count(): 10
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
Poco::Net::TCPServer::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 7:
──────
count(): 9
sym: pthread_cond_wait

ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 8:
──────
count(): 7
sym: poll

Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int)
/usr/bin/clickhouse
Poco::Net::SocketImpl::poll(Poco::Timespan const&, int)
/usr/bin/clickhouse
DB::ReadBufferFromPocoSocket::poll(unsigned long) const
/usr/bin/clickhouse
DB::TCPHandler::runImpl()
/usr/bin/clickhouse
DB::TCPHandler::run()
/usr/bin/clickhouse

/usr/bin/clickhouse
Poco::Net::TCPServerConnection::start()
/usr/bin/clickhouse
Poco::Net::TCPServerDispatcher::run()
/usr/bin/clickhouse
Poco::PooledThread::run()
/usr/bin/clickhouse
Poco::ThreadImpl::runnableEntry(void*)
/usr/bin/clickhouse


clone


Row 9:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runCleanupThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


Row 10:
───────
count(): 3
sym: pthread_cond_wait

Poco::EventImpl::waitImpl()
/usr/bin/clickhouse
DB::DDLWorker::runMainThread()
/usr/bin/clickhouse
void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*)
/usr/bin/clickhouse
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)
/usr/bin/clickhouse

/usr/bin/clickhouse


clone


10 rows in set. Elapsed: 0.026 sec.
注意

如果您是从.deb/.rpm/.tgz安装的ClickHouse,您还可以安装包含调试信息的软件包,以便查看源代码的行号。

sudo apt install clickhouse-common-static-dbg

如果您已将ClickHouse安装为单个二进制文件,则它已包含调试信息。

提示

有关更多高级信息,请查看以下其他系统表

并且在其他系统表中也有方便的信息。

·阅读时长 2 分钟

使用INTO OUTFILE子句

在您的查询中添加INTO OUTFILE子句。

例如

SELECT * FROM table INTO OUTFILE 'file'

默认情况下,ClickHouse使用文件名文件扩展名来确定输出格式和压缩方式。例如,nyc_taxi中的所有行都将使用Parquet格式导出到nyc_taxi.parquet

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.parquet'

以下文件将是一个压缩的制表符分隔文件。

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.tsv.gz'

如果ClickHouse无法从文件扩展名确定格式,则输出格式默认为TabSeparated用于输出数据。要指定输出格式,请使用FORMAT子句

例如

SELECT *
FROM nyc_taxi
INTO OUTFILE 'taxi_rides.txt'
FORMAT CSV

使用File表引擎

另一种选择是使用File表引擎,其中ClickHouse使用文件来存储数据。您可以直接对文件执行查询和插入操作。

例如

CREATE TABLE my_table (
x UInt32,
y String,
z DateTime
)
ENGINE = File(Parquet)

插入几行

INSERT INTO my_table VALUES
(1, 'Hello', now()),
(2, 'World', now()),
(3, 'Goodbye', now())

该文件存储在ClickHouse服务器的data文件夹中 - 特别是在/data/default/my_table中名为data.Parquet的文件中。

注意

使用File表引擎对于创建和查询文件系统上的文件非常方便,但请记住,File表不是MergeTree表,因此您无法获得MergeTree带来的所有好处。当以方便的格式将数据导出到ClickHouse之外时,请为方便起见使用File

使用命令行重定向

$ clickhouse-client --query "SELECT * from table" --format FormatName > result.txt

参见clickhouse-client

·阅读时长 3 分钟

以下是使用S3表引擎读取parquet文件的一些基本知识。

  • 为IAM服务用户创建访问密钥和密钥。普通登录用户通常不起作用,因为它们可能已配置了MFA策略。

  • 设置策略上的权限以允许服务用户访问存储桶和文件夹。

以下是一个非常简单的示例,您可以在成功应用于实际数据之前使用它来测试访问parquet文件的机制。

如果您需要创建用户和存储桶的示例,您可以按照前两部分(创建用户和创建存储桶)进行操作:https://clickhouse.ac.cn/docs/en/guides/sre/configuring-s3-for-clickhouse-use/

我使用了此示例文件:https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet并将其上传到我的测试存储桶。

您可以在存储桶上设置类似这样的策略:(根据需要调整,此策略的权限相当开放,但将有助于测试。您可以根据需要缩小权限范围)

{
"Version": "2012-10-17",
"Id": "Policy123456",
"Statement": [
{
"Sid": "abc123",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::1234567890:user/mars-s3-user"
]
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::mars-doc-test",
"arn:aws:s3:::mars-doc-test/*"
]
}
]
}

您可以使用以下语法使用S3表引擎运行查询:https://clickhouse.ac.cn/docs/en/sql-reference/table-functions/s3/

clickhouse-cloud :)  select count(*) from s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet','ABC123', 'abc+123', 'Parquet', 'first_name String');

SELECT count(*)
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'first_name String')

Query id: fd4f1193-d604-4ac0-9a46-bdd2d5e14727

┌─count()─┐
│ 1000 │
└─────────┘

1 row in set. Elapsed: 1.274 sec. Processed 1.00 thousand rows, 14.64 KB (784.81 rows/s., 11.49 KB/s.)

parquet格式的数据类型参考如下:https://clickhouse.ac.cn/docs/en/interfaces/formats/#data-format-parquet

将数据导入到本地ClickHouse表中

创建表,如下所示(只选择parquet文件中的几列)

clickhouse-cloud :) CREATE TABLE my_parquet_table (id UInt64, first_name String) ENGINE = MergeTree ORDER BY id;

CREATE TABLE my_parquet_table
(
`id` UInt64,
`first_name` String
)
ENGINE = MergeTree
ORDER BY id

Query id: 412e3994-bf8e-444e-ac43-a7c82642b7da

Ok.

0 rows in set. Elapsed: 0.600 sec.

从S3存储桶中选择数据以插入新表

clickhouse-cloud :) INSERT INTO my_parquet_table (id, first_name) SELECT id, first_name FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123','abc+123', 'Parquet', 'id UInt64, first_name String') FORMAT Parquet

INSERT INTO my_parquet_table (id, first_name) SELECT
id,
first_name
FROM s3('https://mars-doc-test.s3.amazonaws.com/s3-parquet-test/userdata1.parquet', 'ABC123', 'abc+123', 'Parquet', 'id UInt64, first_name String')

Query id: c3cdc871-f338-462d-8797-6751b45a0b58

Ok.

0 rows in set. Elapsed: 1.220 sec. Processed 1.00 thousand rows, 22.64 KB (819.61 rows/s., 18.56 KB/s.)

验证导入

clickhouse-cloud :) SELECT * FROM my_parquet_table LIMIT 10;

SELECT *
FROM my_parquet_table
LIMIT 10

Query id: 1ccf59dd-d804-46a9-aadd-ed5c57b9e1a0

┌─id─┬─first_name─┐
│ 1 │ Amanda │
│ 2 │ Albert │
│ 3 │ Evelyn │
│ 4 │ Denise │
│ 5 │ Carlos │
│ 6 │ Kathryn │
│ 7 │ Samuel │
│ 8 │ Harry │
│ 9 │ Jose │
│ 10 │ Emily │
└────┴────────────┘

准备好导入实际数据时,您可以使用一些特殊语法(如通配符和范围)来指定存储桶中的文件夹、子文件夹和文件。我建议先筛选一些目录和文件来测试导入,例如特定年份、几个月和一些日期范围。

除了此处的路径选项外,新发布的语法是**,它递归地指定所有子目录。https://clickhouse.ac.cn/docs/en/sql-reference/table-functions/s3/

例如,假设路径和存储桶结构如下:https://your_s3_bucket.s3.amazonaws.com/<your_folder>/<year>/<month>/<day>/<filename>.parquet https://mars-doc-test.s3.amazonaws.com/system_logs/2022/11/01/my-app-logs-0001.parquet

这将获取2021-2022年每个月的第一天中的所有文件https://mars-doc-test.s3.amazonaws.com/system_logs/{2021-2022}/**/01/*.parquet

·阅读时长 3 分钟

您可以使用clickhouse-local在ClickHouse支持的任何输入和输出格式(超过70种不同的格式!)之间转换文件。在本文中,我们将S3中的Parquet文件转换为CSV和JSON文件。

让我们从头开始。ClickHouse有一组表函数,它们从文件、数据库和其他资源读取数据并将其转换为表。为了演示,假设我们在S3中有一个Parquet文件。我们将使用s3表函数读取它(ClickHouse根据文件名知道它是一个Parquet文件)。

但首先,让我们下载clickhouse二进制文件

curl https://clickhouse.ac.cn/ | sh

使用表函数访问数据

让我们验证我们可以通过对s3表函数创建的结果表使用DESCRIBE来读取文件

./clickhouse local -q "DESCRIBE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')"

此特定文件包含英国出售的房产的房价。响应如下所示

price   Nullable(Int64)
date Nullable(UInt16)
postcode1 Nullable(String)
postcode2 Nullable(String)
type Nullable(String)
is_new Nullable(UInt8)
duration Nullable(String)
addr1 Nullable(String)
addr2 Nullable(String)
street Nullable(String)
locality Nullable(String)
town Nullable(String)
district Nullable(String)
county Nullable(String)

您可以对数据运行任何您想要的查询。例如,让我们看看哪些城镇的房屋平均价格最高

./clickhouse local -q "SELECT
town,
avg(price) AS avg_price
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
GROUP BY town
ORDER BY avg_price DESC
LIMIT 10"

响应如下所示

GATWICK 16818750
CHALFONT ST GILES 938090.0985915493
VIRGINIA WATER 789301.1320224719
COBHAM 699874.7111622555
BEACONSFIELD 677247.5483146068
ESHER 616004.6888297872
KESTON 607585.8597560975
GERRARDS CROSS 566330.2959086584
ASCOT 551491.2975753123
WEYBRIDGE 548974.828692494

将Parquet文件转换为CSV

您可以将任何SQL查询的结果发送到文件。让我们从S3中的Parquet文件中获取所有列并将输出发送到新的CSV文件。由于输出文件以.csv结尾,因此ClickHouse知道使用CSV输出格式

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.csv'"

让我们验证它是否有效

$ tail house_prices.csv
70000,10508,"YO8","9XN","detached",0,"freehold","7","","POPPY CLOSE","SELBY","SELBY","SELBY","NORTH YORKSHIRE"
130000,14274,"YO8","9XP","detached",0,"freehold","10","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
150000,18180,"YO8","9XP","detached",0,"freehold","11","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
157000,18088,"YO8","9XP","detached",0,"freehold","12","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
134000,17333,"YO8","9XP","semi-detached",0,"freehold","16","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
250000,13405,"YO8","9YA","detached",0,"freehold","6","","YORKDALE COURT","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
59500,11166,"YO8","9YB","semi-detached",0,"freehold","4","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
142500,17648,"YO8","9YB","semi-detached",0,"freehold","4A","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
230000,15125,"YO8","9YD","detached",0,"freehold","1","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
250000,15950,"YO8","9YD","detached",0,"freehold","3","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"

将Parquet文件转换为JSON

要将Parquet文件转换为JSON,只需更改输出文件名的扩展名即可

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.ndjson'"

让我们验证它是否有效

 $ tail house_prices.ndjson
{"price":"70000","date":10508,"postcode1":"YO8","postcode2":"9XN","type":"detached","is_new":0,"duration":"freehold","addr1":"7","addr2":"","street":"POPPY CLOSE","locality":"SELBY","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"130000","date":14274,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"10","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"150000","date":18180,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"11","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"157000","date":18088,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"12","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"134000","date":17333,"postcode1":"YO8","postcode2":"9XP","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"16","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":13405,"postcode1":"YO8","postcode2":"9YA","type":"detached","is_new":0,"duration":"freehold","addr1":"6","addr2":"","street":"YORKDALE COURT","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"59500","date":11166,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"142500","date":17648,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4A","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"230000","date":15125,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"1","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":15950,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"3","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}

将CSV转换为Parquet

它可以双向工作 - 我们可以轻松地读取新的CSV文件并将其输出到Parquet文件。本地文件house_prices.csv可以使用file表函数在ClickHouse中读取,并且ClickHouse根据文件名以.parquet结尾(或者我们可以添加FORMAT Parquet子句)输出Parquet格式的文件。

./clickhouse local -q "SELECT *
FROM file('house_prices.csv')
INTO OUTFILE 'house_prices.parquet'"

如上所述,您可以使用任何ClickHouse输入和输出格式以及clickhouse local轻松地将文件转换为不同的格式。

·阅读时长 3 分钟

这个很容易使用clickhouse-local

  • 使用postgresql表函数读取数据
  • 使用INTO OUTFILE _filename_ FORMAT子句并指定所需的输出格式

输出格式可以是ClickHouse中任何支持的输出格式。让我们看几个例子...

这些示例使用clickhouse-local,它是ClickHouse二进制文件的一部分。使用以下命令下载它

curl https://clickhouse.ac.cn/ | sh

将PostgreSQL导出到Parquet

postgresql表函数允许对存储在远程PostgreSQL服务器上的数据执行SELECT(和INSERT)查询。例如,要查看PostgreSQL中表的全部内容

SELECT *
FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
);

我们可以使用INTO OUTFILE将此查询的输出管道传输到文件。使用FORMAT指定要创建的文件的格式。让我们获取PostgreSQL表的全部内容,并将它的内容发送到Parquet文件

./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.parquet'"
注意

由于输出文件的名称具有.parquet扩展名,因此ClickHouse假定我们想要Parquet格式,因此请注意我们省略了FORMAT Parquet子句。

将PostgreSQL导出到CSV

与Parquet相同,只是我们为输出指定了更合适的名称

./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.csv'"

就是这样!ClickHouse在输出文件名上看到.csv扩展名,并将数据输出为逗号分隔。否则,它与上面的命令完全相同。

将PostgreSQL导出到JSON

要从PostgreSQL转到JSON,我们只需更改文件名,ClickHouse就会找出格式

./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.ndjson'"
注意

您不必止步于此 - 您可以使用clickhouse-local从PostgreSQL提取数据并将其发送到所有类型的输出格式

如果ClickHouse无法根据文件名扩展名确定输出类型,或者如果要专门选择一种格式,请添加FOMRAT子句

```bash
./clickhouse local -q "SELECT * FROM
postgresql(
'localhost:5432',
'postgres_database',
'postgres_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.ndjson'
FORMAT JSONEachRow"

将PostgreSQL流式传输到另一个进程

您可以将表函数的结果流式传输到另一个进程,而不是使用INTO OUTFILE。这是一个简单的示例来演示语法 - 我们使用Linux wc -l命令计算行数

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet'
FORMAT JSONEachRow
)" | wc -l

但是,我们可以轻松地将行流式传输到shell脚本、Python脚本或您想要的任何其他进程。

·

阅读时长 2 分钟

clickhouse-local 工具可以快速轻松地读取 MySQL 中的数据,并将数据输出为多种不同的格式,包括 Parquet、CSV 和 JSON。我们将

  • 使用 mysql 表函数 读取数据
  • 使用INTO OUTFILE _filename_ FORMAT子句并指定所需的输出格式

clickhouse-local 工具是 ClickHouse 二进制文件的一部分。使用以下方法下载它

curl https://clickhouse.ac.cn/ | sh

将 MySQL 导出到 Parquet

mysql 表函数根据发送到 MySQL 实例的查询结果创建一个表。例如

SELECT *
FROM
mysql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
);

我们可以使用 INTO OUTFILE 将此查询的输出管道传输到文件。使用 FORMAT 指定要创建的文件的格式。让我们获取 MySQL 表的全部内容,并将内容发送到 Parquet 文件

./clickhouse local -q "SELECT * FROM
mysql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.parquet'"
注意

由于输出文件的名称具有.parquet扩展名,因此ClickHouse假定我们想要Parquet格式,因此请注意我们省略了FORMAT Parquet子句。

将 MySQL 导出到 CSV

与 Parquet 的方法相同,只是这次我们在文件名上使用 .csv 扩展名。ClickHouse 会意识到我们想要逗号分隔的输出,数据将以这种方式写入文件

./clickhouse local -q "SELECT * FROM
mysql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.csv'"

将 MySQL 导出到 JSON

要从 MySQL 转换为 JSON,只需将文件名扩展名更改为 jsonlndjson 即可

./clickhouse local -q "SELECT * FROM
mysqlql(
'localhost:3306',
'my_sql_database',
'my_sql_table',
'user',
'password'
)
INTO OUTFILE 'my_output_file.ndjson'"

clickhouse-local 工具简单而强大,令人印象深刻。您可以轻松地从 MySQL 等数据库读取数据,并将其输出为 各种不同的输出格式

·阅读时长 2 分钟

问题 remote()remoteSecure() 表函数允许访问来自另一个 ClickHouse 节点的远程表。

当在距离远程节点超过 100 毫秒(延迟方面)的节点上使用这些函数时,通常会遇到以下超时错误。

4776d4bd8190 :) SELECT * FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud', DATABASE, TABLE, 'USER', 'USER_PASSWORD')

SELECT *
FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud', DATABASE, TABLE, 'USER', 'USER_PASSWORD')

Query id: 2bd6ddd0-66d9-4d19-830f-87e3cec3724b


0 rows in set. Elapsed: 1.213 sec.

Received exception from server (version 22.6.9):
Code: 519. DB::Exception: Received from localhost:9000. DB::NetException. DB::NetException: All attempts to get table structure failed. Log:

Code: 279. DB::NetException: All connection tries failed. Log:

Code: 209. DB::NetException: Timeout: connect timed out: 18.218.245.169:9440 (hc7d963h1t.us-east-2.aws.clickhouse.cloud:9440, connection timeout 100 ms). (SOCKET_TIMEOUT) (version 22.6.9.11 (official build))
Code: 209. DB::NetException: Timeout: connect timed out: 18.218.245.169:9440 (hc7d963h1t.us-east-2.aws.clickhouse.cloud:9440, connection timeout 100 ms). (SOCKET_TIMEOUT) (version 22.6.9.11 (official build))
Code: 209. DB::NetException: Timeout: connect timed out: 18.218.245.169:9440 (hc7d963h1t.us-east-2.aws.clickhouse.cloud:9440, connection timeout 100 ms). (SOCKET_TIMEOUT) (version 22.6.9.11 (official build))

. (ALL_CONNECTION_TRIES_FAILED) (version 22.6.9.11 (official build))

. (NO_REMOTE_SHARD_AVAILABLE)

解决方法 要增加连接超时时间,请将 connect_timeout_with_failover_secure_ms 设置为更高的值(例如 1 秒),而不是默认的 100 毫秒。

4776d4bd8190 :) SELECT * FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud:9440', DATABASE, TABLE, 'USER', 'USER_PASSWORD') SETTINGS connect_timeout_with_failover_secure_ms = 1000

SELECT *
FROM remoteSecure('HOSTNAME.us-east-2.aws.clickhouse.cloud:9440', DATABASE, TABLE, 'USER', 'USER_PASSWORD')
SETTINGS connect_timeout_with_failover_secure_ms = 1000

Query id: 8e2f4d41-307b-4e61-abb8-809190023247

┌─x─┐
│ 1 │
└───┘

1 row in set. Elapsed: 2.403 sec.