atlas总结
2022-06-22 12:49:19 2 举报
AI智能生成
元数据管理框架atlas
作者其他创作
大纲/内容
深入剖析
官网文档 http://atlas.apache.org/2.1.0/index.html#/
Atlas开发指南(中文版)https://mantoudev.com/mantouBook/Atlas_cn/
Atlas开发指南(中文版)https://mantoudev.com/mantouBook/Atlas_cn/
简介
https://www.cnblogs.com/mantoudev/p/9986408.html
atlas数据模型(重点理解,后面会自定义)
Atlas的术语表(Glossary)
https://www.cnblogs.com/mantoudev/p/9965869.html
atlas-webapp中与术语相关的操作API
webapp/src/main/java/org/apache/atlas/web/rest/GlossaryREST.java
Atlas的元数据模型Type System
https://www.cnblogs.com/mantoudev/p/9985600.html
https://cloud.tencent.com/developer/article/1503998
Atlas允许用户为他们想要管理的元数据对象定义模型。
该模型由称为type(类型)的定义组成。称为entities(实体)的type(类型)实例表示受管理的实际元数据对象。
Type System是一个允许用户定义和管理类型和实体的组件。
开箱即用的Atlas管理的所有元数据对象(例如Hive表)都使用类型建模并表示为实体。
开箱即用的Atlas管理的所有元数据对象(例如Hive表)都使用类型建模并表示为实体。
要在Atlas中存储新类型的元数据,需要了解类型系统组件的概念。
atlas元数据和索引存储
Atlas使用JanusGraph存储和管理元数据。
默认情况下,Atlas使用独立的HBase实例作为JanusGraph的底层存储。
Atlas通过JanusGraph索引元数据以支持全文搜索查询。
为了给索引存储提供HA,我们建议将Atlas配置为使用Solr或Elasticsearch作为JanusGraph的索引存储支撑
图数据库引擎JanusGraph
图存储后端-hbase/cassandra
图索引后端-solr/elasticsearch
默认情况下,Atlas使用独立的HBase实例作为JanusGraph的底层存储。
Atlas通过JanusGraph索引元数据以支持全文搜索查询。
为了给索引存储提供HA,我们建议将Atlas配置为使用Solr或Elasticsearch作为JanusGraph的索引存储支撑
图数据库引擎JanusGraph
图存储后端-hbase/cassandra
图索引后端-solr/elasticsearch
源代码模块结构分析
https://www.cnblogs.com/wang3680/p/13968277.html
别人的实践记录
Atlas 2.1.0 实践(1)—— 编译Atlas
https://cloud.tencent.com/developer/article/1764110
Atlas 2.1.0 实践(2)—— 安装Atlas
https://cloud.tencent.com/developer/article/1768539?from=article.detail.1764110
Apache Atlas 1.2.0 部署手册(基于集群已有组件HBase和ElasticSearch,不使用内嵌的HBase和Solr)
https://blog.csdn.net/xueyao0201/article/details/94310199
Atlas 2.1.0 实践(3)—— Atlas集成HIve
https://cloud.tencent.com/developer/article/1781542
Atlas 2.1.0 实践(4)—— 权限控制
https://cloud.tencent.com/developer/article/1785134
元模型
概述
Atlas 用 Type/Entity 模型来组织所有的元数据对象,它们的关系相当于 OOP 中对应的 Class/Instance
Type 可以分为多个 Metatype:Enum/Collection(Array,Map)/Composite(Entity, Struct, Classification, Relationship)
Composite 可以有多 Attribute,而 Attribute 可以指向 Metatype 从而建立丰富的关系,
有趣的是 Entity 和 Classification 是可以继承关系的,
真正存放元数据信息的叫 Entity,例如:一张 Hive 表。
Type 可以分为多个 Metatype:Enum/Collection(Array,Map)/Composite(Entity, Struct, Classification, Relationship)
Composite 可以有多 Attribute,而 Attribute 可以指向 Metatype 从而建立丰富的关系,
有趣的是 Entity 和 Classification 是可以继承关系的,
真正存放元数据信息的叫 Entity,例如:一张 Hive 表。
源码分析
主要有以下几个概念:
Type类型
Entity 实体
Attributes属性
Type类型
Entity 实体
Attributes属性
AtlasType: intg/src/main/java/org/apache/atlas/type/AtlasType.java
TypeCategory:intg/src/main/java/org/apache/atlas/model/TypeCategory.java
PRIMITIVE, OBJECT_ID_TYPE, ARRAY, MAP, ENUM, STRUCT,
CLASSIFICATION, ENTITY, RELATIONSHIP, BUSINESS_METADATA
SuperTypes:
Asset、 DataSet、 Process、 Referenceable
intg/src/main/java/org/apache/atlas/model/typedef/AtlasBaseTypeDef.java 抽象基础类型
-----intg/src/main/java/org/apache/atlas/model/typedef/AtlasEnumDef.java 枚举类型
---------- private List<AtlasEnumElementDef> elementDefs;
-----intg/src/main/java/org/apache/atlas/model/typedef/AtlasStructDef.java 结构类型
---------- private List<AtlasAttributeDef> attributeDefs;
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasClassificationDef.java 分类类型
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasRelationshipDef.java 关系类型
---------- intg/src/main/java/org/apache/atlas/model/typedef/AtlasRelationshipEndDef.java
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasEntityDef.java 实体类型
---------- private List<AtlasRelationshipAttributeDef> relationshipAttributeDefs;
---------- private Map<String, List<AtlasAttributeDef>> businessAttributeDefs;
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasBusinessMetadataDef.java 业务元数据类型
intg/src/main/java/org/apache/atlas/model/typedef/AtlasTypesDef.java Type类型
private List<AtlasEnumDef> enumDefs;
private List<AtlasStructDef> structDefs;
private List<AtlasClassificationDef> classificationDefs;
private List<AtlasEntityDef> entityDefs;
private List<AtlasRelationshipDef> relationshipDefs;
private List<AtlasBusinessMetadataDef> businessMetadataDefs;
TypeCategory:intg/src/main/java/org/apache/atlas/model/TypeCategory.java
PRIMITIVE, OBJECT_ID_TYPE, ARRAY, MAP, ENUM, STRUCT,
CLASSIFICATION, ENTITY, RELATIONSHIP, BUSINESS_METADATA
SuperTypes:
Asset、 DataSet、 Process、 Referenceable
intg/src/main/java/org/apache/atlas/model/typedef/AtlasBaseTypeDef.java 抽象基础类型
-----intg/src/main/java/org/apache/atlas/model/typedef/AtlasEnumDef.java 枚举类型
---------- private List<AtlasEnumElementDef> elementDefs;
-----intg/src/main/java/org/apache/atlas/model/typedef/AtlasStructDef.java 结构类型
---------- private List<AtlasAttributeDef> attributeDefs;
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasClassificationDef.java 分类类型
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasRelationshipDef.java 关系类型
---------- intg/src/main/java/org/apache/atlas/model/typedef/AtlasRelationshipEndDef.java
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasEntityDef.java 实体类型
---------- private List<AtlasRelationshipAttributeDef> relationshipAttributeDefs;
---------- private Map<String, List<AtlasAttributeDef>> businessAttributeDefs;
----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasBusinessMetadataDef.java 业务元数据类型
intg/src/main/java/org/apache/atlas/model/typedef/AtlasTypesDef.java Type类型
private List<AtlasEnumDef> enumDefs;
private List<AtlasStructDef> structDefs;
private List<AtlasClassificationDef> classificationDefs;
private List<AtlasEntityDef> entityDefs;
private List<AtlasRelationshipDef> relationshipDefs;
private List<AtlasBusinessMetadataDef> businessMetadataDefs;
示例分析
Hive
addons/hive-bridge/src/main/java/org/apache/atlas/hive/model/HiveDataTypes.java
addons/models/1000-Hadoop/1030-hive_model.json
docs/src/documents/Hook/HookHive.md
addons/models/1000-Hadoop/1030-hive_model.json
docs/src/documents/Hook/HookHive.md
Kafka
addons/kafka-bridge/src/main/java/org/apache/atlas/kafka/model/KafkaDataTypes.java
addons/models/1000-Hadoop/1070-kafka_model.json
docs/src/documents/Hook/HookKafka.md
addons/models/1000-Hadoop/1070-kafka_model.json
docs/src/documents/Hook/HookKafka.md
Sqoop
addons/sqoop-bridge/src/main/java/org/apache/atlas/sqoop/model/SqoopDataTypes.java
addons/models/1000-Hadoop/1040-sqoop_model.json
docs/src/documents/Hook/HookSqoop.md
addons/models/1000-Hadoop/1040-sqoop_model.json
docs/src/documents/Hook/HookSqoop.md
如何自定义扩展模型
https://atlas.apache.org/2.1.0/index.html#/TypeSystem
https://www.cnblogs.com/163yun/p/9015985.html
https://www.cnblogs.com/mantoudev/p/9985600.html
https://blog.csdn.net/rlnLo2pNEfx9c/article/details/106846113
https://www.cnblogs.com/163yun/p/9015985.html
https://www.cnblogs.com/mantoudev/p/9985600.html
https://blog.csdn.net/rlnLo2pNEfx9c/article/details/106846113
元数据集成之离线导入demo数据
本示例最主要目的
(1)展示如何以离线方式创建自定义的元数据,包括类型系统和具体实体
(2)展示如何查询元数据实体及血缘关系
(2)展示如何查询元数据实体及血缘关系
执行入口
python bin/quick_start.py
python源码
distro/src/bin/quick_start.py
java主类
webapp/src/main/java/org/apache/atlas/examples/QuickStartV2.java
主要是构建元数据,然后通过REST接口发送给Atlas服务器
主要是构建元数据,然后通过REST接口发送给Atlas服务器
REST地址:
######### Server Properties #########
atlas.rest.address=http://localhost:21000
java客户端工具类:
client/client-v2/src/main/java/org/apache/atlas/AtlasClientV2.java
######### Server Properties #########
atlas.rest.address=http://localhost:21000
java客户端工具类:
client/client-v2/src/main/java/org/apache/atlas/AtlasClientV2.java
主要逻辑分析
创建类型系统
// Shows how to create v2 types in Atlas for your meta model
quickStartV2.createTypes();
其中核心逻辑:
AtlasTypesDef atlasTypesDef = createTypeDefinitions();
atlasClientV2.createAtlasTypeDefs(atlasTypesDef);
quickStartV2.createTypes();
其中核心逻辑:
AtlasTypesDef atlasTypesDef = createTypeDefinitions();
atlasClientV2.createAtlasTypeDefs(atlasTypesDef);
创建类型系统的各种实体
// Shows how to create v2 entities (instances) for the added types in Atlas
quickStartV2.createEntities();
其中核心逻辑:
构建各种Type的实例AtlasEntity
intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java
然后调用AtlasClientV2发送请求
EntityMutationResponse response = atlasClientV2.createEntity(entityWithExtInfo);
quickStartV2.createEntities();
其中核心逻辑:
构建各种Type的实例AtlasEntity
intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java
然后调用AtlasClientV2发送请求
EntityMutationResponse response = atlasClientV2.createEntity(entityWithExtInfo);
展示 DSL Queries
// Shows some search queries using DSL based on types
quickStartV2.search();
其中核心逻辑:
AtlasSearchResult results = atlasClientV2.dslSearchWithParams(dslQuery, 10, 0);
quickStartV2.search();
其中核心逻辑:
AtlasSearchResult results = atlasClientV2.dslSearchWithParams(dslQuery, 10, 0);
展示如何查询实体的血缘关系
// Shows some lineage information on entity
quickStartV2.lineage();
其中核心逻辑:
AtlasLineageInfo lineageInfo =
atlasClientV2.getLineageInfo(getTableId(SALES_FACT_DAILY_MV_TABLE), LineageDirection.BOTH, 0);
quickStartV2.lineage();
其中核心逻辑:
AtlasLineageInfo lineageInfo =
atlasClientV2.getLineageInfo(getTableId(SALES_FACT_DAILY_MV_TABLE), LineageDirection.BOTH, 0);
集成hive测试
atlas集成hive配置
将atlas-application.property copy至hive客户端的conf目录
将atlas项目编译之后的hive hook相关文件夹拷贝到hive客户端
我自己编译atlas项目之后的路径如下
D:\workspace\idea\atlas\distro\target\apache-atlas-2.1.0-hive-hook\apache-atlas-hive-hook-2.1.0
我自己编译atlas项目之后的路径如下
D:\workspace\idea\atlas\distro\target\apache-atlas-2.1.0-hive-hook\apache-atlas-hive-hook-2.1.0
修改hive-site.xml 配置
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>
增加jar包环境,修改hive-env.sh
export HIVE_AUX_JARS_PATH=/usr/local/hive/hook/hive
hive客户端环境
[root@hadoop01 ~]# which hive
/usr/local/hive/bin/hive
/usr/local/hive/bin/hive
[root@hadoop01 ~]# cd /usr/local/hive/
[root@hadoop01 hive]# pwd
/usr/local/hive
[root@hadoop01 hive]# pwd
/usr/local/hive
[root@hadoop01 hive]# ll
总用量 11044
drwxr-xr-x 3 root root 179 8月 6 13:55 bin
drwxr-xr-x 2 root root 4096 9月 2 20:35 conf
drwxr-xr-x 4 root root 34 3月 24 16:10 examples
drwxr-xr-x 7 root root 68 3月 24 16:10 hcatalog
drwxr-xr-x 3 root root 18 9月 3 09:59 hook
drwxr-xr-x 2 root root 28 9月 3 09:59 hook-bin
-rw-r--r-- 1 root root 2040 9月 3 10:04 hook-bin.zip
-rw-r--r-- 1 root root 11251678 9月 3 10:04 hook.zip
总用量 11044
drwxr-xr-x 3 root root 179 8月 6 13:55 bin
drwxr-xr-x 2 root root 4096 9月 2 20:35 conf
drwxr-xr-x 4 root root 34 3月 24 16:10 examples
drwxr-xr-x 7 root root 68 3月 24 16:10 hcatalog
drwxr-xr-x 3 root root 18 9月 3 09:59 hook
drwxr-xr-x 2 root root 28 9月 3 09:59 hook-bin
-rw-r--r-- 1 root root 2040 9月 3 10:04 hook-bin.zip
-rw-r--r-- 1 root root 11251678 9月 3 10:04 hook.zip
[root@hadoop01 hive]# ll hook-bin
总用量 8
-rw-r--r-- 1 root root 4246 8月 19 10:38 import-hive.sh
总用量 8
-rw-r--r-- 1 root root 4246 8月 19 10:38 import-hive.sh
[root@hadoop01 hive]# ll hook
总用量 0
drwxr-xr-x 3 root root 112 9月 3 09:59 hive
总用量 0
drwxr-xr-x 3 root root 112 9月 3 09:59 hive
[root@hadoop01 hive]# ll hook/hive/
总用量 36
drwxr-xr-x 2 root root 4096 9月 3 09:59 atlas-hive-plugin-impl
-rw-r--r-- 1 root root 17506 9月 3 09:58 atlas-plugin-classloader-2.1.0.jar
-rw-r--r-- 1 root root 11563 9月 3 09:58 hive-bridge-shim-2.1.0.jar
总用量 36
drwxr-xr-x 2 root root 4096 9月 3 09:59 atlas-hive-plugin-impl
-rw-r--r-- 1 root root 17506 9月 3 09:58 atlas-plugin-classloader-2.1.0.jar
-rw-r--r-- 1 root root 11563 9月 3 09:58 hive-bridge-shim-2.1.0.jar
[root@hadoop01 hive]# ll hook/hive/atlas-hive-plugin-impl/
总用量 12260
-rw-r--r-- 1 root root 37495 9月 3 09:51 atlas-client-common-2.1.0.jar
-rw-r--r-- 1 root root 42189 9月 3 09:51 atlas-client-v1-2.1.0.jar
-rw-r--r-- 1 root root 22362 9月 3 09:51 atlas-client-v2-2.1.0.jar
-rw-r--r-- 1 root root 79688 9月 3 09:51 atlas-common-2.1.0.jar
-rw-r--r-- 1 root root 559518 9月 3 09:51 atlas-intg-2.1.0.jar
-rw-r--r-- 1 root root 64144 9月 3 09:51 atlas-notification-2.1.0.jar
-rw-r--r-- 1 root root 362679 6月 30 14:04 commons-configuration-1.10.jar
-rw-r--r-- 1 root root 96551 9月 3 09:58 hive-bridge-2.1.0.jar
-rw-r--r-- 1 root root 66897 7月 29 14:41 jackson-annotations-2.9.9.jar
-rw-r--r-- 1 root root 325632 7月 29 14:41 jackson-core-2.9.9.jar
-rw-r--r-- 1 root root 1400944 6月 15 21:47 jackson-databind-2.10.0.jar
-rw-r--r-- 1 root root 165345 6月 15 22:02 jersey-json-1.19.jar
-rw-r--r-- 1 root root 53275 7月 29 14:40 jersey-multipart-1.19.jar
-rw-r--r-- 1 root root 45927 7月 29 19:18 jsr311-api-1.1.jar
-rw-r--r-- 1 root root 7295202 7月 29 14:40 kafka_2.11-2.0.0.jar
-rw-r--r-- 1 root root 1893564 7月 29 14:40 kafka-clients-2.0.0.jar
总用量 12260
-rw-r--r-- 1 root root 37495 9月 3 09:51 atlas-client-common-2.1.0.jar
-rw-r--r-- 1 root root 42189 9月 3 09:51 atlas-client-v1-2.1.0.jar
-rw-r--r-- 1 root root 22362 9月 3 09:51 atlas-client-v2-2.1.0.jar
-rw-r--r-- 1 root root 79688 9月 3 09:51 atlas-common-2.1.0.jar
-rw-r--r-- 1 root root 559518 9月 3 09:51 atlas-intg-2.1.0.jar
-rw-r--r-- 1 root root 64144 9月 3 09:51 atlas-notification-2.1.0.jar
-rw-r--r-- 1 root root 362679 6月 30 14:04 commons-configuration-1.10.jar
-rw-r--r-- 1 root root 96551 9月 3 09:58 hive-bridge-2.1.0.jar
-rw-r--r-- 1 root root 66897 7月 29 14:41 jackson-annotations-2.9.9.jar
-rw-r--r-- 1 root root 325632 7月 29 14:41 jackson-core-2.9.9.jar
-rw-r--r-- 1 root root 1400944 6月 15 21:47 jackson-databind-2.10.0.jar
-rw-r--r-- 1 root root 165345 6月 15 22:02 jersey-json-1.19.jar
-rw-r--r-- 1 root root 53275 7月 29 14:40 jersey-multipart-1.19.jar
-rw-r--r-- 1 root root 45927 7月 29 19:18 jsr311-api-1.1.jar
-rw-r--r-- 1 root root 7295202 7月 29 14:40 kafka_2.11-2.0.0.jar
-rw-r--r-- 1 root root 1893564 7月 29 14:40 kafka-clients-2.0.0.jar
[root@hadoop01 hive]# cat conf/hive-env.sh | grep -i HIVE_AUX_JARS_PATH
export HIVE_AUX_JARS_PATH=/usr/local/hive/hook/hive
export HIVE_AUX_JARS_PATH=/usr/local/hive/hook/hive
[root@hadoop01 hive]# cat conf/hive-site.xml | grep -C3 -i atlas
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>
离线导入hive库表
hook-bin/import-hive.sh
hive实时hook之hive driver端
启动hive客户端
[root@hadoop01 hive]# hive
hive> drop database db0903;
......
2021-09-03 10:40:49,917 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(554)) - Compiling command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b): drop database db0903
2021-09-03 10:40:49,933 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:<init>(177)) - 222==============================
2021-09-03 10:40:49,934 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:checkConcurrency(285)) - Concurrency mode is disabled, not creating a lock manager
2021-09-03 10:40:49,966 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(666)) - Semantic Analysis Completed (retrial = false)
2021-09-03 10:40:49,966 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:getSchema(374)) - Returning Hive schema: Schema(fieldSchemas:null, properties:null)
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(781)) - Completed compiling command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b); Time taken: 0.05 seconds
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] reexec.ReExecDriver (ReExecDriver.java:run(156)) - Execution #1 of query
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:checkConcurrency(285)) - Concurrency mode is disabled, not creating a lock manager
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:execute(2255)) - Executing command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b): drop database db0903
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:launchTask(2662)) - Starting task [Stage-0:DDL] in serial mode
2021-09-03 10:40:50,640 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(186)) - 222==============================queryStr:drop database db0903
2021-09-03 10:40:50,640 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(187)) - 222==============================LINKIS.SUBMIT.USER:null
2021-09-03 10:40:50,641 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(188)) - 222==============================LINKIS.TASK.NAME:null
hive> drop database db0903;
......
2021-09-03 10:40:49,917 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(554)) - Compiling command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b): drop database db0903
2021-09-03 10:40:49,933 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:<init>(177)) - 222==============================
2021-09-03 10:40:49,934 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:checkConcurrency(285)) - Concurrency mode is disabled, not creating a lock manager
2021-09-03 10:40:49,966 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(666)) - Semantic Analysis Completed (retrial = false)
2021-09-03 10:40:49,966 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:getSchema(374)) - Returning Hive schema: Schema(fieldSchemas:null, properties:null)
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(781)) - Completed compiling command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b); Time taken: 0.05 seconds
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] reexec.ReExecDriver (ReExecDriver.java:run(156)) - Execution #1 of query
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:checkConcurrency(285)) - Concurrency mode is disabled, not creating a lock manager
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:execute(2255)) - Executing command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b): drop database db0903
2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:launchTask(2662)) - Starting task [Stage-0:DDL] in serial mode
2021-09-03 10:40:50,640 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(186)) - 222==============================queryStr:drop database db0903
2021-09-03 10:40:50,640 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(187)) - 222==============================LINKIS.SUBMIT.USER:null
2021-09-03 10:40:50,641 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(188)) - 222==============================LINKIS.TASK.NAME:null
[root@hadoop01 hive]# ps -ef | grep -i hive
root 21130 5116 2 10:30 pts/3 00:00:08 /usr/jdk1.8.0_191/bin/java -Dproc_jar -Djava.net.preferIPv4Stack=true -Dproc_hivecli -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/usr/local/hive/conf/parquet-logging.properties -Dyarn.log.dir=/usr/local/hadoop-3.2.1/logs -Dyarn.log.file=hadoop.log -Dyarn.home.dir=/usr/local/hadoop-3.2.1 -Dyarn.root.logger=INFO,console -Djava.library.path=/usr/local/hadoop-3.2.1/lib/native -Xmx256m -Dhadoop.log.dir=/usr/local/hadoop-3.2.1/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/hadoop-3.2.1 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/local/hive/lib/hive-cli-3.1.2.jar org.apache.hadoop.hive.cli.CliDriver --hiveconf hive.aux.jars.path=file:///home/usr_local/hive/hook/hive/atlas-plugin-classloader-2.1.0.jar,file:///home/usr_local/hive/hook/hive/hive-bridge-shim-2.1.0.jar
hive实时hook之hive server2服务端
hive实时hook之hive metastoreserver服务端
配置好之后,重启metastore服务,可以看到加载了相关配置文件和atlas相关的类
[root@hadoop01 hook]# hive --service metastore
......
2021-09-03 09:06:02,086 INFO [main] conf.MetastoreConf (MetastoreConf.java:findConfigFile(1240)) - Found configuration file file:/home/usr_local/hive/conf/hive-site.xml
2021-09-03 09:06:03,040 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startupShutdownMessage(9236)) - STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting HiveMetaStore
STARTUP_MSG: host = hadoop01/172.24.2.232
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.2
STARTUP_MSG: classpath = /usr/local/hive/conf:/home/usr_local/hive/hook/hive/atlas-plugin-classloader-2.1.0.jar:/home/usr_local/hive/hook/hive/hive-bridge-shim-2.1.0.jar:/
......
2021-09-03 09:06:05,550 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:get(121)) - Looking for atlas-application.properties in classpath
2021-09-03 09:06:05,550 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:get(134)) - Loading atlas-application.properties from file:/home/usr_local/hive/conf/atlas-application.properties
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(314)) - Using graphdb backend 'janus'
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(325)) - Using storage backend 'hbase2'
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(336)) - Using index backend 'elasticsearch'
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(360)) - Setting atlas.graph.index.search.max-result-set-size = 150
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache = true
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache-clean-wait = 20
2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache-size = 0.5
2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.tx-cache-size = 15000
2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.tx-dirty-size = 120
2021-09-03 09:06:05,583 INFO [main] kafka.KafkaNotification (KafkaNotification.java:<init>(115)) - ==> KafkaNotification()
2021-09-03 09:06:05,592 INFO [main] kafka.KafkaNotification (KafkaNotification.java:<init>(149)) - <== KafkaNotification()
2021-09-03 09:06:05,624 INFO [main] hook.AtlasHook (AtlasHook.java:<clinit>(141)) - Created Atlas Hook
2021-09-03 09:06:05,628 INFO [main] hook.HiveHook (HiveHook.java:<init>(177)) - 222==============================
2021-09-03 09:06:05,679 INFO [main] conf.HiveConf (HiveConf.java:findConfigFile(187)) - Found configuration file file:/home/usr_local/hive/conf/hive-site.xml
后台守护方式启动metastore服务
nohup hive --service metastore >/usr/local/hive/metastore.log 2>&1 &
......
2021-09-03 09:06:02,086 INFO [main] conf.MetastoreConf (MetastoreConf.java:findConfigFile(1240)) - Found configuration file file:/home/usr_local/hive/conf/hive-site.xml
2021-09-03 09:06:03,040 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startupShutdownMessage(9236)) - STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting HiveMetaStore
STARTUP_MSG: host = hadoop01/172.24.2.232
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.2
STARTUP_MSG: classpath = /usr/local/hive/conf:/home/usr_local/hive/hook/hive/atlas-plugin-classloader-2.1.0.jar:/home/usr_local/hive/hook/hive/hive-bridge-shim-2.1.0.jar:/
......
2021-09-03 09:06:05,550 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:get(121)) - Looking for atlas-application.properties in classpath
2021-09-03 09:06:05,550 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:get(134)) - Loading atlas-application.properties from file:/home/usr_local/hive/conf/atlas-application.properties
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(314)) - Using graphdb backend 'janus'
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(325)) - Using storage backend 'hbase2'
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(336)) - Using index backend 'elasticsearch'
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(360)) - Setting atlas.graph.index.search.max-result-set-size = 150
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache = true
2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache-clean-wait = 20
2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache-size = 0.5
2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.tx-cache-size = 15000
2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.tx-dirty-size = 120
2021-09-03 09:06:05,583 INFO [main] kafka.KafkaNotification (KafkaNotification.java:<init>(115)) - ==> KafkaNotification()
2021-09-03 09:06:05,592 INFO [main] kafka.KafkaNotification (KafkaNotification.java:<init>(149)) - <== KafkaNotification()
2021-09-03 09:06:05,624 INFO [main] hook.AtlasHook (AtlasHook.java:<clinit>(141)) - Created Atlas Hook
2021-09-03 09:06:05,628 INFO [main] hook.HiveHook (HiveHook.java:<init>(177)) - 222==============================
2021-09-03 09:06:05,679 INFO [main] conf.HiveConf (HiveConf.java:findConfigFile(187)) - Found configuration file file:/home/usr_local/hive/conf/hive-site.xml
后台守护方式启动metastore服务
nohup hive --service metastore >/usr/local/hive/metastore.log 2>&1 &
SQL测试实例
SET LINKIS.SUBMIT.USER=suyc;
SET LINKIS.TASK.NAME=ws01-pj01-flow01;
create database db0903;
create table db0903.person01(
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
create table db0903.person02(
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
insert overwrite table db0903.person02 select * from db0903.person01;
create table db0903.person03 as select * from db0903.person02;
SET LINKIS.TASK.NAME=ws01-pj01-flow01;
create database db0903;
create table db0903.person01(
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
create table db0903.person02(
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
insert overwrite table db0903.person02 select * from db0903.person01;
create table db0903.person03 as select * from db0903.person02;
hive hook二次开发
hive client远程debug
https://www.cnblogs.com/songchaolin/p/13084252.htm
hive --debug 客户端启动远程debug监听
本地idea进行远程debug连接
需求
采集hive相关元数据时需要加入最上层的业务信息,将业务元数据与技术元数据关联起来
hive driver客户端端hook的原理
hive-site.xml 添加的hook配置
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
atlas项目中源代码入口位置
addons/hive-bridge-shim/src/main/java/org/apache/atlas/hive/hook/HiveHook.java
原理Hive Hook
hive抽象类
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
atlas具体实现
addons/hive-bridge-shim/src/main/java/org/apache/atlas/hive/hook/HiveHook.java
addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
atlas具体实现
addons/hive-bridge-shim/src/main/java/org/apache/atlas/hive/hook/HiveHook.java
addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java
hive server2服务端hook的原理
hive metestore服务端hook的原理
hive-site.xml 添加的hook配置
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>
atlas项目中源代码入口位置
addons/hive-bridge-shim/src/main/java/org/apache/atlas/hive/hook/HiveMetastoreHook.java
原理 Hive Listener
hive抽象类
org.apache.hadoop.hive.metastore.MetaStoreEventListener
atlas具体实现
addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveMetastoreHookImpl.java
org.apache.hadoop.hive.metastore.MetaStoreEventListener
atlas具体实现
addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveMetastoreHookImpl.java
具体实现
集成spark2.4.x测试
存在问题
SAC hook支持的spark版本问题
目前只支持spark2.4.x,不支持spark3.x
目前只支持spark2.4.x,不支持spark3.x
SAC hook搜集的血缘信息不全
实际测试之后,验证了如下结论:
Known Limitations (Design decision)
SAC only supports SQL/DataFrame API (in other words, SAC doesn't support RDD).
All "inputs" and "outputs" in multiple queries are accumulated into single "spark_process" entity when there're multple queries running in single Spark session.
SAC classifies table related entities with two different kind of models: Spark / Hive.
We decided to skip sending create events for Hive tables managed by HMS to avoid duplication of those events from Atlas hook for Hive . For Hive entities, Atlas relies on Atlas hook for Hive as the source of truth.
实际测试之后,验证了如下结论:
Known Limitations (Design decision)
SAC only supports SQL/DataFrame API (in other words, SAC doesn't support RDD).
All "inputs" and "outputs" in multiple queries are accumulated into single "spark_process" entity when there're multple queries running in single Spark session.
SAC classifies table related entities with two different kind of models: Spark / Hive.
We decided to skip sending create events for Hive tables managed by HMS to avoid duplication of those events from Atlas hook for Hive . For Hive entities, Atlas relies on Atlas hook for Hive as the source of truth.
spark ddl不在spark hook中搜集,需要开启hive metastore hook来采集
SAC
https://github.com/hortonworks-spark/spark-atlas-connector
特别说明: 经过实际测试,SAC只支持spark2.4.x版本,不支持spark3.x版本,java class不兼容
特别说明: 经过实际测试,SAC只支持spark2.4.x版本,不支持spark3.x版本,java class不兼容
编译SAC
[root@hadoop03 spark-atlas-connector]# pwd
/home/atlas/spark-atlas-connector
/home/atlas/spark-atlas-connector
[root@hadoop03 spark-atlas-connector]# git status
# 位于分支 master
无文件要提交,干净的工作区
# 位于分支 master
无文件要提交,干净的工作区
[root@hadoop03 spark-atlas-connector]# mvn clean
[root@hadoop03 spark-atlas-connector]# mvn package -DskipTests
忽略test编译时的错误: mvn clean package -Dmaven.test.skip=true
......
[root@hadoop03 spark-atlas-connector]# mvn package -DskipTests
忽略test编译时的错误: mvn clean package -Dmaven.test.skip=true
......
[root@hadoop03 spark-atlas-connector]# ll spark-atlas-connector-assembly/target/
drwxr-xr-x 2 root root 28 8月 5 18:10 antrun
drwxr-xr-x 2 root root 28 8月 5 18:10 maven-archiver
-rw-r--r-- 1 root root 2803 8月 5 18:10 original-spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
drwxr-xr-x 4 root root 41 8月 5 18:10 scala-2.11
-rw-r--r-- 1 root root 41679846 8月 5 18:10 spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
drwxr-xr-x 2 root root 6 8月 5 18:10 tmp
drwxr-xr-x 2 root root 28 8月 5 18:10 antrun
drwxr-xr-x 2 root root 28 8月 5 18:10 maven-archiver
-rw-r--r-- 1 root root 2803 8月 5 18:10 original-spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
drwxr-xr-x 4 root root 41 8月 5 18:10 scala-2.11
-rw-r--r-- 1 root root 41679846 8月 5 18:10 spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
drwxr-xr-x 2 root root 6 8月 5 18:10 tmp
安装spark gateway并测试sac
atlas服务部署
在hadoop04机器,我们已经部署好atlas服务(包括一整套zk+kafka+hbase+es)
基础环境
[root@hadoop01 usr_local]# which java
/usr/jdk1.8.0_191/bin/java
[root@hadoop01 usr_local]# which hadoop
/usr/local/hadoop-3.2.1/bin/hadoop
[root@hadoop01 usr_local]# which hive
/usr/local/hive/bin/hive
[root@hadoop01 usr_local]# which spark
/usr/bin/which: no spark in (/usr/local/scala/bin:/usr/local/flink/bin:/usr/local/apache-maven-3.6.1/bin:/usr/jdk1.8.0_191/bin:/usr/jdk1.8.0_191/jre/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/hadoop-3.2.1/bin:/usr/local/hive/bin:/usr/local/spark/bin:/usr/local/hadoop-3.2.1/etc/hadoop:/root/bin)
[root@hadoop01 usr_local]# echo $SPARK_HOME
/usr/local/spark
/usr/jdk1.8.0_191/bin/java
[root@hadoop01 usr_local]# which hadoop
/usr/local/hadoop-3.2.1/bin/hadoop
[root@hadoop01 usr_local]# which hive
/usr/local/hive/bin/hive
[root@hadoop01 usr_local]# which spark
/usr/bin/which: no spark in (/usr/local/scala/bin:/usr/local/flink/bin:/usr/local/apache-maven-3.6.1/bin:/usr/jdk1.8.0_191/bin:/usr/jdk1.8.0_191/jre/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/hadoop-3.2.1/bin:/usr/local/hive/bin:/usr/local/spark/bin:/usr/local/hadoop-3.2.1/etc/hadoop:/root/bin)
[root@hadoop01 usr_local]# echo $SPARK_HOME
/usr/local/spark
atlas文件准备
[root@hadoop01 ~]# ll /home/atlas_files/
-rw-r--r-- 1 root root 12332 8月 6 13:41 atlas-application.properties
-rw-r--r-- 1 root root 41679846 8月 6 13:41 spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
-rw-r--r-- 1 root root 12332 8月 6 13:41 atlas-application.properties
-rw-r--r-- 1 root root 41679846 8月 6 13:41 spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
spark gateway部署及配置文件准备
[root@hadoop01 usr_local]# pwd
/home/usr_local
[root@hadoop01 usr_local]# unzip spark-2.4.7.zip
[root@hadoop01 spark-2.4.7]# pwd
/home/usr_local/spark-2.4.7
[root@hadoop01 spark-2.4.7]# cp /usr/local/hive/conf/hive-site.xml ./conf/
[root@hadoop01 spark-2.4.7]# ll conf/ | grep -i "atlas\|hive"
-rw-r--r-- 1 root root 12332 8月 6 13:35 atlas-application.properties
-rw-r--r-- 1 root root 2212 8月 6 13:37 hive-site.xml
[root@hadoop01 spark-2.4.7]# cat conf/spark-env.sh | grep -iv "#"
export JAVA_HOME=/usr/jdk1.8.0_191/
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop-3.2.1/bin/hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop-3.2.1/conf
export HIVE_HOME=/usr/local/hive
[root@hadoop01 ~]# ln -s /home/usr_local/spark-2.4.7 /usr/local/spark2.4.7
[root@hadoop01 ~]# cd /usr/local/spark2.4.7/
[root@hadoop01 spark2.4.7]# pwd
/usr/local/spark2.4.7
-----------已有其他spark环境,我们需要测试自己的版本
[root@hadoop01 spark2.4.7]# echo $SPARK_HOME
/usr/local/spark
[root@hadoop01 spark2.4.7]# export SPARK_HOME=/usr/local/spark2.4.7
[root@hadoop01 spark2.4.7]# echo $SPARK_HOME
/usr/local/spark2.4.7
/home/usr_local
[root@hadoop01 usr_local]# unzip spark-2.4.7.zip
[root@hadoop01 spark-2.4.7]# pwd
/home/usr_local/spark-2.4.7
[root@hadoop01 spark-2.4.7]# cp /usr/local/hive/conf/hive-site.xml ./conf/
[root@hadoop01 spark-2.4.7]# ll conf/ | grep -i "atlas\|hive"
-rw-r--r-- 1 root root 12332 8月 6 13:35 atlas-application.properties
-rw-r--r-- 1 root root 2212 8月 6 13:37 hive-site.xml
[root@hadoop01 spark-2.4.7]# cat conf/spark-env.sh | grep -iv "#"
export JAVA_HOME=/usr/jdk1.8.0_191/
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop-3.2.1/bin/hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop-3.2.1/conf
export HIVE_HOME=/usr/local/hive
[root@hadoop01 ~]# ln -s /home/usr_local/spark-2.4.7 /usr/local/spark2.4.7
[root@hadoop01 ~]# cd /usr/local/spark2.4.7/
[root@hadoop01 spark2.4.7]# pwd
/usr/local/spark2.4.7
-----------已有其他spark环境,我们需要测试自己的版本
[root@hadoop01 spark2.4.7]# echo $SPARK_HOME
/usr/local/spark
[root@hadoop01 spark2.4.7]# export SPARK_HOME=/usr/local/spark2.4.7
[root@hadoop01 spark2.4.7]# echo $SPARK_HOME
/usr/local/spark2.4.7
测试
-----------不带hook启动spark client
[root@hadoop01 spark2.4.7]# bin/spark-shell --master yarn
[root@hadoop01 spark2.4.7]# bin/spark-sql --master yarn
[root@hadoop01 spark2.4.7]# bin/spark-shell --master yarn
[root@hadoop01 spark2.4.7]# bin/spark-sql --master yarn
-----------带hook启动spark client
bin/spark-sql --master yarn --executor-memory 1G --executor-cores 1 \
--files /home/atlas_files/atlas-application.properties \
--jars /home/atlas_files/spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar \
--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
bin/spark-sql --master yarn --executor-memory 1G --executor-cores 1 \
--files /home/atlas_files/atlas-application.properties \
--jars /home/atlas_files/spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar \
--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
-----------spark hook忽略掉了
spark-sql> create database db_suyc;
spark-sql> create database db_suyc;
-----------spark hook忽略掉了
CREATE TABLE db_suyc.person22(
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
CREATE TABLE db_suyc.person22(
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
-----------spark hook捕获了,并实时发送给atlas
spark-sql> create table db_suyc.person33 as select * from db_suyc.person22;
spark-sql> create table db_suyc.person33 as select * from db_suyc.person22;
-----------spark hook捕获了,并实时发送给atlas
spark-sql> insert into default.demo_01 select * from default.demo_02;
spark-sql> insert into default.demo_01 select * from default.demo_02;
远程debug
问题
希望动态化地深入理解atlas的原理源码;
解决方案一——本地调试环境
https://my.oschina.net/u/4286379/blog/4329390
atlas不适合部署在win系统上,而个人电脑又恰好是win10,所以本地调试环境不好搭建
子主题解放方案二——远程debug方式
atlas服务开启远程debug模式
我修改了bin/atlas_start.py文件,只修改了一行代码:
#DEFAULT_JVM_OPTS="-Dlog4j.configuration=atlas-log4j.xml -Djava.net.preferIPv4Stack=true -server"
DEFAULT_JVM_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=9999 -Dlog4j.configuration=atlas-log4j.xml -Djava.net.preferIPv4Stack=true -server"
#DEFAULT_JVM_OPTS="-Dlog4j.configuration=atlas-log4j.xml -Djava.net.preferIPv4Stack=true -server"
DEFAULT_JVM_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=9999 -Dlog4j.configuration=atlas-log4j.xml -Djava.net.preferIPv4Stack=true -server"
本地idea进行远程debug
https://www.cnblogs.com/wy2325/p/5600232.html
需要注意远程主机的IP和PORT要填写正确
然后像本地调试一样设置断点
关注的调试点
atlas对外暴露的REST接口
断点插入代码位置:
webapp/src/main/java/org/apache/atlas/web/rest
触发调试方法:
使用postman进行http请求测试
webapp/src/main/java/org/apache/atlas/web/rest
触发调试方法:
使用postman进行http请求测试
atlas消费ATLAS_HOOK这个kafka topic中hook message的逻辑
断点插入代码位置:
webapp/src/main/java/org/apache/atlas/notification/NotificationHookConsumer.java
webapp/src/main/java/org/apache/atlas/notification/preprocessor/HivePreprocessor.java
触发调试方法:
执行hive sql,触发hive hook推送消息到kafka topic;
或执行spark sql,触发spark hook推送消息到kafka topic;
webapp/src/main/java/org/apache/atlas/notification/NotificationHookConsumer.java
webapp/src/main/java/org/apache/atlas/notification/preprocessor/HivePreprocessor.java
触发调试方法:
执行hive sql,触发hive hook推送消息到kafka topic;
或执行spark sql,触发spark hook推送消息到kafka topic;
hive hook执行逻辑
hive client远程debug
https://www.cnblogs.com/songchaolin/p/13084252.html
hive server 2 服务端及beeline client客户端 远程debug方式
https://blog.csdn.net/merrily01/article/details/105725414/
https://www.cnblogs.com/songchaolin/p/13084252.html
hive server 2 服务端及beeline client客户端 远程debug方式
https://blog.csdn.net/merrily01/article/details/105725414/
spark hook执行逻辑
spark client远程debug
https://blog.csdn.net/asfjgvajfghaklsbf/article/details/109671367
https://blog.csdn.net/asfjgvajfghaklsbf/article/details/109671367
REST API总结及测试
背景
我们可以通过Atlas对外暴露的REST API来与其进行交互,从而实现元数据的增删改查
必要时,我们可以对其进行二次开发,来适配我们的需求
必要时,我们可以对其进行二次开发,来适配我们的需求
参考文档
对外REST API
源码
webapp/src/main/java/org/apache/atlas/web/rest
webapp/src/main/java/org/apache/atlas/web/rest
官方API文档
http://atlas.apache.org/api/v2/
http://atlas.apache.org/api/v2/
swagger文档
http://atlas.apache.org/api/v2/ui/index.html#/
http://atlas.apache.org/api/v2/ui/index.html#/
其它参考
https://www.jianshu.com/p/a37ae460986f
https://blog.csdn.net/wangpei1949/article/details/87891862
https://marcel-jan.eu/datablog/2019/09/03/the-atlas-rest-api-working-examples/
https://www.jianshu.com/p/a37ae460986f
https://blog.csdn.net/wangpei1949/article/details/87891862
https://marcel-jan.eu/datablog/2019/09/03/the-atlas-rest-api-working-examples/
对外Java API
源码
client/client-v2/src/main/java/org/apache/atlas/AtlasClientV2.java
本质上是对REST API的封装
client/client-v2/src/main/java/org/apache/atlas/AtlasClientV2.java
本质上是对REST API的封装
API总结及测试
AdminREST
----------------------------AdminREST查看Atlas Metadata Server节点状态 GET /admin/status
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/admin/status"
查看Atlas版本和描述 GET /admin/version
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/admin/version”
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/admin/status"
查看Atlas版本和描述 GET /admin/version
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/admin/version”
DiscoveryREST
#查询所有Hive库
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_db"
#查询所有Hive表
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_table"
#查询所有Hive表,且包含某一关键字
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_table&query=ads_gmv_sum_day”
#查询所有Hive库
http://hadoop04:21000/api/atlas/entities?type=hive_db
#查询所有Hive表
http://hadoop04:21000/api/atlas/entities?type=hive_table
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_db"
#查询所有Hive表
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_table"
#查询所有Hive表,且包含某一关键字
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_table&query=ads_gmv_sum_day”
#查询所有Hive库
http://hadoop04:21000/api/atlas/entities?type=hive_db
#查询所有Hive表
http://hadoop04:21000/api/atlas/entities?type=hive_table
TypesREST
检索所有Type,并返回所有信息 GET /v2/types/typedefs
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/types/typedefs"
检索所有Type,并返回最少信息 GET /v2/types/typedefs/headers
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/types/typedefs/headers"
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/types/typedefs"
检索所有Type,并返回最少信息 GET /v2/types/typedefs/headers
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/types/typedefs/headers"
EntityREST
查询某个表的GUID
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?query=gdyinfo_new&typeName=hive_table"
curl -s -u admin:admin "http://hadoop04:21000/v2/entity/uniqueAttribute/type/hive_table?attr:qualifiedName=default.demo_02@primary"
批量根据GUID检索Entity GET /v2/entity/bulk
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=2dd4ca4c-9d33-4c19-bca3-f60e162debf2”
获取某个Entity定义 GET /v2/entity/guid/{guid}
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=2dd4ca4c-9d33-4c19-bca3-f60e162debf2"
获取某个Entity的TAG列表 GET /v2/entity/guid/{guid}/classifications
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/guid/2dd4ca4c-9d33-4c19-bca3-f60e162debf2/classifications"
获取一个包含某个attribute(属性)的entity:
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/entities?
type={type_name}&property={unique_attribute_name}&value={unique_attribute_value}
更新entity的一个attribute属性
PUT
http://hadoop04:21000/api/atlas/v2/entity/guid/0e822d4c-a578-4b0a-b9e6-085096fbf92f?name=comment
"这是一个测试的表 by suyc"
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?query=gdyinfo_new&typeName=hive_table"
curl -s -u admin:admin "http://hadoop04:21000/v2/entity/uniqueAttribute/type/hive_table?attr:qualifiedName=default.demo_02@primary"
批量根据GUID检索Entity GET /v2/entity/bulk
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=2dd4ca4c-9d33-4c19-bca3-f60e162debf2”
获取某个Entity定义 GET /v2/entity/guid/{guid}
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=2dd4ca4c-9d33-4c19-bca3-f60e162debf2"
获取某个Entity的TAG列表 GET /v2/entity/guid/{guid}/classifications
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/guid/2dd4ca4c-9d33-4c19-bca3-f60e162debf2/classifications"
获取一个包含某个attribute(属性)的entity:
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/entities?
type={type_name}&property={unique_attribute_name}&value={unique_attribute_value}
更新entity的一个attribute属性
PUT
http://hadoop04:21000/api/atlas/v2/entity/guid/0e822d4c-a578-4b0a-b9e6-085096fbf92f?name=comment
"这是一个测试的表 by suyc"
LineageREST
生成血缘数据
通过Atlas的RestAPI接口新增Process,可以生成血缘数据。
例如将Atlas元数据管理的MySQL数据库表和hive数据表关联生成血缘数据,
先查到两张表的guid值,然后构造请求数据调用
接口:http://{atlas_host}:21000/api/atlas/v2/entity/bulk
请求消息:
{
"entities":[
{
"typeName":"Process",
"attributes":{
"owner":"root",
"createTime":"2020-05-07T10:32:21.0Z",
"updateTime":"",
"qualifiedName":"people@process@mysql://192.168.1.1:3306",
"name":"peopleProcess",
"description":"people Process",
"comment":"test people Process",
"contact_info":"jdbc",
"type":"table",
"inputs":[
{
"guid":"5a676b74-e058-4e81-bcf8-42d73f4c1729",
"typeName":"rdbms_table"
}
],
"outputs":[
{
"guid":"2e7c70e1-5a8a-4430-859f-c46d267e33fd",
"typeName":"hive_table"
}
]
}
}
]
}
查询某个Entity的Lineage GET /v2/lineage/{guid}
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/lineage/2dd4ca4c-9d33-4c19-bca3-f60e162debf2"
curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/lineage/2dd4ca4c-9d33-4c19-bca3-f60e162debf2"
生成血缘数据
通过Atlas的RestAPI接口新增Process,可以生成血缘数据。
例如将Atlas元数据管理的MySQL数据库表和hive数据表关联生成血缘数据,
先查到两张表的guid值,然后构造请求数据调用
接口:http://{atlas_host}:21000/api/atlas/v2/entity/bulk
请求消息:
{
"entities":[
{
"typeName":"Process",
"attributes":{
"owner":"root",
"createTime":"2020-05-07T10:32:21.0Z",
"updateTime":"",
"qualifiedName":"people@process@mysql://192.168.1.1:3306",
"name":"peopleProcess",
"description":"people Process",
"comment":"test people Process",
"contact_info":"jdbc",
"type":"table",
"inputs":[
{
"guid":"5a676b74-e058-4e81-bcf8-42d73f4c1729",
"typeName":"rdbms_table"
}
],
"outputs":[
{
"guid":"2e7c70e1-5a8a-4430-859f-c46d267e33fd",
"typeName":"hive_table"
}
]
}
}
]
}
通过API方式集成RDBMS
创建rdbms_instance、rdbms_db、rdbms_column、rdbms_table
https://www.codeleading.com/article/29371584292/
请求方式:Post
请求路径:http://hadoop04:21000/api/atlas/v2/entity
请求验证方式:BaseAuth admin/admin
https://www.codeleading.com/article/29371584292/
请求方式:Post
请求路径:http://hadoop04:21000/api/atlas/v2/entity
请求验证方式:BaseAuth admin/admin
0 条评论
下一页