首页  思维导图  详情

Hive

2023-07-20 15:40:49   0  举报





AI智能生成

数仓，hive

数仓

hive

作者其他创作

大纲/内容

压缩和存储

压缩

lzo

snappy

存储

行存储

TextFile

SEQUENCEFILE

列存储

ORC

parquet

函数

常用内置函数

289个

show functions;

desc function year;

简单的描述

desc function extended year;

详细的，有案例的描述

空字段赋值

nvl(value,default_value)

select ename, sal,comm, sal+nvl(comm, 0) income from emp;

case

select deptno, sum(case sex when “男” then 1 else 0 end ) man_sum, sum(case sex whern "女" then 1 else 0 end ) woman_sum from emp_sex group by deptno;

行转列

分支主题

concat('1','2','3')

'123'

concat_ws(',', '1','2','3')

'1,2,3'

string | array(string)

collect_set()

去重汇总

select collect-set(deptno) from emp;

[10,20,30]

列转行

分支主题

explode(array | map)

array

一列

map

两列

split("str1",",")

返回值--》 array

lateral view

select name, friend from person lateral view explode(friends) tmp as friend ;

开窗函数

分支主题

over()

window

unbounded preceing

从起点开始

unbouded following

直到终点边界

n preceing

向前n行

n following

向后n行

current row

当前行

partition by

order by

Rank()

可重复总数不变

1，2，2，4

dense_rank()

可重复，总数减少

1，2，2，3

row_number()

不可重复

1，2，3，4

自定义函数

继承GenericUDF实现其方法

public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { // 1. 判断输入参数的个数 // 2. 判断输入参数的类型 // 3. 返回返回值类型鉴别器对象 }

public Object evaluate(DeferredObject[] arguments) throws HiveException { // 1. 判断参数是否为null // 2. 计算参数的长度，并返回 }

分区表和分桶表

分区表

本质：分的是文件夹

继续分：二级分区

分区字段和表的字段关系

分桶表

本质：分的是文件

查询分区分桶表的时候，使用where对分区分桶进行过滤

查询

基本的查询

查询全表及特定字段

select * from student；

select id ,name from student;

列的别名

select deptno, avg(sal) sal_avg from emp group by deptno

表的别名

提高查询效效率

select t.id, t.name, b.age, b.id from table01 t join table02 b on t.id = b.id;

算数运算符

select ename, sal, comm, nvl(comm,0) +sal income from emp;

逻辑运算符

AND

select * from emp where sal > 1000 and deptno =30;

select * from emp where sal >1000 or deptno=30;

NOT

select * from emp where deptno is not in(20 , 30);

常用函数

count(*)

max()

min()

avg()

sum()

limit 语句

select * from emp limit 5;

select * from emp limit 2, 3;

where 语句

查询过滤的作用

select * from emp where sal >1000;

where 后不能使用列的别名

比较运算符

A<=>B

select ename, mgr, comm , mgr<=> comm from emp;

A [not] between B and C

select * from emp where sal betwenn 1000 and 5000;

is null

is not null

in(value1, value2)

select * from emp where sal in(800, 5000);

简单的正则表达式，通配符模式

select ename from emp enam like 'A%';

rlike

完全的正则表达式--java

select ename from emp ename rlike '^A';

分组

group by

计算emp表中每个部门的平均薪资

select deptno , avg(sal) from emp group by deptno;

having

计算emp表中每个部门平均薪资大于2000的有哪些

select deptno , avg(sal) sal_avg from emp group by deptno having sal_avg >2000;

join

内连接

join

分支主题

inner join

左外连接

left join

分支主题

这个如何实现

右外连接

right join

分支主题

这个如何实现

满外连接

full join

分支主题

这个如何实现

多表连接

笛卡尔积

分支主题

连接条件缺少，写了和没写一样

select ename, deptno from emp, dept;

排序

全局排序

order by

只有一个Reduce

分区排序

分区排序的排序

sort by

在每个Reduce进行排序

分区内有序全局无序

单独使用，分区规则是什么，随机

分区排序的分区

distribute by

按照指定的分区字段进行分区

分区规则：分区字段.hash % num_reduce

cluster by

分区排序种分区字段和排序字段相同时，可以使用

cluster by 仅支持升序排序，不支持自定义，desc 或asc

DML数据操作语言

数据导入（4）

load

local

本地（当前节点）

没有local

HDFS

insert

insert into tablename values(), ();

insert into tableName select * from table1 ;

hdfs -put

as select* from table_name;

数据导出

insert ---- directory

local

数据导出到本地

没有local

数据导出到hdfs

数据迁移

export

import

元数据和真实数据

概念

hive是基于hadoop的数据仓库工具，提供了类SQL语言便于查询计算

本质：将HQL转换MapReduce程序

基本架构图：

分支主题

hive与数据库的对比

查询语句

hive

提供了HQL，一种SQL方言

数据库

标准SQL

数据规模

hive

基于Hadoop的，存储数据用hdfs，计算引擎用MapReduce

支持大规模数据统计分析

数据库

数据规模相对较小

Mysql---》 innodB

数据的更新

hive

不建议对数仓数据进行更新

存储底层框架HDFS，不支持随机修改

数据规模大

对用与多读少写场景，多存储静态数据

数据库

多用于在线应用

可以实现高效的随机更新

数据的延迟

hive

查询延迟较高

计算引擎MapReduce，部署执行依赖yarn集群

延迟高，辩证来看待，hive可以支持大规模数据

数据库

延迟低

数据数据量有限

数据类型

基本数据类型

int

bigint

float

double

string

boolean

集合数据类型

array

friend array<String>

friend[0]

map

children map<string, string>

children["key"]

struct

address struct<street: string, city: string>

address.city

数据类型转换

select "1" + 2;

3.0

select cast("1" as int) +2;

DDl数据定义语言

数据库

创建

create database bigdata [location ] [with dbproperties("key" ="value")]

修改

alter database bigdata set dbproperties("key" = "value")

查询

show databases;

show database like "big*";

desc database [extended] bigdata;

删除

drop database bigdata [cascade]

表

创建

create [external] table table_name ( col_name col_type) row format delimited fields terminated by "," collection items terminated by "_" map keys terminated by ":" lines terminated by "\n" [store as textFile] [loaction ] [as select * from student] [like table_name]

内部表：

元数据和真实数据

中件表、结果表

外部表：

元数据

共享的数据、最初始的数据

修改

更新列

alter tale table_name change [column] col_0ld_name col_new_name col_type [first | after col_name]

添加替换列

alter table table_name add columns( col_name col_type , col_name_02 col_type_02 )

alter table table_name replace columns( id string);