GVKun编程网logo

'LEFT JOIN'vs'LEFT OUTER JOIN'

13

针对'LEFTJOIN'vs'LEFTOUTERJOIN'这个问题,本篇文章进行了详细的解答,同时本文还将给你拓展(转载)Mysql----Join用法(Innerjoin,Leftjoin,Righ

针对'LEFT JOIN'vs'LEFT OUTER JOIN'这个问题,本篇文章进行了详细的解答,同时本文还将给你拓展(转载) Mysql----Join用法(Inner join,Left join,Right join, Cross join, Union模拟Full join)及---性能优化、Apache beam Join.leftOuterJoin,如何传递空白TableRow?、Hive_LEFT SEMI JOIN / LEFT OUTER JOIN 与 (IN / NOT IN), (EXISTS / NOT EXISTS ) 分析、Hive的join操作,left join,right join,inner join等相关知识,希望可以帮助到你。

本文目录一览:

'LEFT JOIN'vs'LEFT OUTER JOIN'

'LEFT JOIN'vs'LEFT OUTER JOIN'

我知道确实没有什么区别,但是“ LEFT JOIN”是ANSI形式,还是有任何RDBMS都将导致“ LEFT JOIN”失败并需要“ LEFT OUTER
JOIN”。[我在这里问是为了节省一些点击次数,填写表格等,以获取正确的ANSI标准!]

答案1

小编典典

根据ANSI规范,[OUTER]是可选的(92,但是我敢肯定以后的版本也会覆盖它)。

当然,您假设每个SQL产品都是ANSI兼容的。对于联接,它们可能是..

(转载) Mysql----Join用法(Inner join,Left join,Right join, Cross join, Union模拟Full join)及---性能优化

(转载) Mysql----Join用法(Inner join,Left join,Right join, Cross join, Union模拟Full join)及---性能优化

http://blog.csdn.net/ochangwen/article/details/52346610

 

前期数据准备

CREATE TABLE  atable(
aID int( 1 ) AUTO_INCREMENT PRIMARY KEY ,
aNum char( 20 ));

CREATE TABLE btable(
bID int( 1 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
bName char( 20 ) );

INSERT INTO atable
VALUES ( 1, ''a20050111'' ) , ( 2, ''a20050112'' ) , ( 3, ''a20050113'' ) , ( 4, ''a20050114'' ) , ( 5, ''a20050115'' ) ;

INSERT INTO btable
VALUES ( 1, '' 2006032401'' ) , ( 2, ''2006032402'' ) , ( 3, ''2006032403'' ) , ( 4, ''2006032404'' ) , ( 8, ''2006032408'' ) ;

-------------------------------------------------------------------------------------------

atable:左表;btable:右表。
JOIN 按照功能大致分为如下三类:
  1).inner join(内连接,或等值连接):取得两个表中存在连接匹配关系的记录。
  2).left  join(左连接):取得左表(atable)完全记录,即是右表(btable)并无对应匹配记录。
  3).right join(右连接):与 LEFT JOIN 相反,取得右表(btable)完全记录,即是左表(atable)并无匹配对应记录。
注意:mysql不支持Full join,不过可以通过 union 关键字来合并 left join 与 right join来模拟full join.

一、Inner join

  内连接,也叫等值连接,inner join产生同时符合A和B的一组数据。
  接下来给出一个列子用于解释下面几种分类。如下两个表(A,B)

 
  1. mysql> select * from atable  inner join btable  on atable.aid=btable.bid;  
  2. +-----+-----------+-----+-------------+  
  3. | aID | aNum      | bID | bName       |  
  4. +-----+-----------+-----+-------------+  
  5. |   1 | a20050111 |   1 |  2006032401 |  
  6. |   2 | a20050112 |   2 | 2006032402  |  
  7. |   3 | a20050113 |   3 | 2006032403  |  
  8. |   4 | a20050114 |   4 | 2006032404  |  
  9. +-----+-----------+-----+-------------+  

 

二、Left join

  left join,(或left outer join:在Mysql中两者等价,推荐使用left join.)左连接从左表(A)产生一套完整的记录,与匹配的记录(右表(B)) .如果没有匹配,右侧将包含null。

 

 
  1. mysql> select * from atable  left join btable  on atable.aid=btable.bid;  
  2. +-----+-----------+------+-------------+  
  3. | aID | aNum      | bID  | bName       |  
  4. +-----+-----------+------+-------------+  
  5. |   1 | a20050111 |    1 |  2006032401 |  
  6. |   2 | a20050112 |    2 | 2006032402  |  
  7. |   3 | a20050113 |    3 | 2006032403  |  
  8. |   4 | a20050114 |    4 | 2006032404  |  
  9. |   5 | a20050115 | NULL | NULL        |  
  10. +-----+-----------+------+-------------+  


------------------------------------------------------------------------------------------------------------

 

  2).如果想只从左表(A)中产生一套记录,但不包含右表(B)的记录,可以通过设置where语句来执行,如下

 

 
  1. mysql> select * from atable  left join btable  on atable.aid=btable.bid   
  2.     -> where atable.aid is  null or btable.bid is  null;  
  3. +-----+-----------+------+-------+  
  4. | aID | aNum      | bID  | bName |  
  5. +-----+-----------+------+-------+  
  6. |   5 | a20050115 | NULL | NULL  |  
  7. +-----+-----------+------+-------+  


-----------------------------------------------------------------------------------------

 

同理,还可以模拟inner join. 如下:


 

 
  1. mysql> select * from atable  left join btable  on atable.aid=btable.bid  where atable.aid is not null and btable.bid is not null;  
  2. +-----+-----------+------+-------------+  
  3. | aID | aNum      | bID  | bName       |  
  4. +-----+-----------+------+-------------+  
  5. |   1 | a20050111 |    1 |  2006032401 |  
  6. |   2 | a20050112 |    2 | 2006032402  |  
  7. |   3 | a20050113 |    3 | 2006032403  |  
  8. |   4 | a20050114 |    4 | 2006032404  |  
  9. +-----+-----------+------+-------------+  

------------------------------------------------------------------------------------------

 

三、Right join

  同Left join

 

  1. mysql> select * from atable  right join btable  on atable.aid=btable.bid;  
  2. +------+-----------+-----+-------------+  
  3. | aID  | aNum      | bID | bName       |  
  4. +------+-----------+-----+-------------+  
  5. |    1 | a20050111 |   1 |  2006032401 |  
  6. |    2 | a20050112 |   2 | 2006032402  |  
  7. |    3 | a20050113 |   3 | 2006032403  |  
  8. |    4 | a20050114 |   4 | 2006032404  |  
  9. NULL | NULL      |   8 | 2006032408  |  
  10. +------+-----------+-----+-------------+  

 

四、差集

 

 
  1. mysql> select * from atable  left join btable  on atable.aid=btable.bid    
  2.     -> where btable.bid is null  
  3.     -> union  
  4.     -> select * from atable right join btable on atable.aid=btable.bid  
  5.     -> where atable.aid is null;  
  6. +------+-----------+------+------------+  
  7. | aID  | aNum      | bID  | bName      |  
  8. +------+-----------+------+------------+  
  9. |    5 | a20050115 | NULL | NULL       |  
  10. NULL | NULL      |    8 | 2006032408 |  
  11. +------+-----------+------+------------+  


-----------------------------------------------------------------------------------

 

五.Cross join

  交叉连接,得到的结果是两个表的乘积,即笛卡尔积

    笛卡尔(Descartes)乘积又叫直积。假设集合A={a,b},集合B={0,1,2},则两个集合的笛卡尔积为{(a,0),(a,1),(a,2),(b,0),(b,1), (b,2)}。可以扩展到多个集合的情况。类似的例子有,如果A表示某学校学生的集合,B表示该学校所有课程的集合,则A与B的笛卡尔积表示所有可能的选课情况。

 

 
  1. mysql> select * from atable cross join btable;  
  2. +-----+-----------+-----+-------------+  
  3. | aID | aNum      | bID | bName       |  
  4. +-----+-----------+-----+-------------+  
  5. |   1 | a20050111 |   1 |  2006032401 |  
  6. |   2 | a20050112 |   1 |  2006032401 |  
  7. |   3 | a20050113 |   1 |  2006032401 |  
  8. |   4 | a20050114 |   1 |  2006032401 |  
  9. |   5 | a20050115 |   1 |  2006032401 |  
  10. |   1 | a20050111 |   2 | 2006032402  |  
  11. |   2 | a20050112 |   2 | 2006032402  |  
  12. |   3 | a20050113 |   2 | 2006032402  |  
  13. |   4 | a20050114 |   2 | 2006032402  |  
  14. |   5 | a20050115 |   2 | 2006032402  |  
  15. |   1 | a20050111 |   3 | 2006032403  |  
  16. |   2 | a20050112 |   3 | 2006032403  |  
  17. |   3 | a20050113 |   3 | 2006032403  |  
  18. |   4 | a20050114 |   3 | 2006032403  |  
  19. |   5 | a20050115 |   3 | 2006032403  |  
  20. |   1 | a20050111 |   4 | 2006032404  |  
  21. |   2 | a20050112 |   4 | 2006032404  |  
  22. |   3 | a20050113 |   4 | 2006032404  |  
  23. |   4 | a20050114 |   4 | 2006032404  |  
  24. |   5 | a20050115 |   4 | 2006032404  |  
  25. |   1 | a20050111 |   8 | 2006032408  |  
  26. |   2 | a20050112 |   8 | 2006032408  |  
  27. |   3 | a20050113 |   8 | 2006032408  |  
  28. |   4 | a20050114 |   8 | 2006032408  |  
  29. |   5 | a20050115 |   8 | 2006032408  |  
  30. +-----+-----------+-----+-------------+  
  31. 25 rows in set (0.00 sec)  
  32.   
  33.  <pre><code class="hljs cs"><span class="hljs-function">#再执行:mysql> <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> A inner <span class="hljs-keyword">join</span> B</span>; 试一试 (与上面的结果一样)  
  34.   
  35. <span class="hljs-meta">#在执行mysql> select * from A cross join B on A.name = B.name; 试一试</span></code>  

 

    实际上,在 MySQL 中(仅限于 MySQL) CROSS JOIN 与 INNER JOIN 的表现是一样的,在不指定 ON 条件得到的结果都是笛卡尔积,反之取得两个表完全匹配的结果。    inner join 与 cross join 可以省略 inner 或 cross关键字,因此下面的 SQL 效果是一样的:

 
  1. ... FROM table1 INNER JOIN table2  
  2. ... FROM table1 CROSS JOIN table2  
  3. ... FROM table1 JOIN table2  

 

六.union实现Full join

    全连接产生的所有记录(双方匹配记录)在表A和表B。如果没有匹配,则对面将包含null。与差集类似。

 

 
  1. mysql> select * from atable  left join btable  on atable.aid=btable.bid  
  2.     -> union  
  3.     -> select * from atable right join btable on atable.aid=btable.bid;  
  4. +------+-----------+------+-------------+  
  5. | aID  | aNum      | bID  | bName       |  
  6. +------+-----------+------+-------------+  
  7. |    1 | a20050111 |    1 |  2006032401 |  
  8. |    2 | a20050112 |    2 | 2006032402  |  
  9. |    3 | a20050113 |    3 | 2006032403  |  
  10. |    4 | a20050114 |    4 | 2006032404  |  
  11. |    5 | a20050115 | NULL | NULL        |  
  12. NULL | NULL      |    8 | 2006032408  |  
  13. +------+-----------+------+-------------+  


--------------------------------------------------------------------------------------------------------

 

七.性能优化

  1.显示(explicit) inner join VS 隐式(implicit) inner join

 

 
  1. select * from  
  2. table a inner join table b  
  3. on a.id = b.id;  

VS

 
  1. select a.*, b.*  
  2. from table a, table b  
  3. where a.id = b.id;  

    数据库中比较(10w数据)得之,它们用时几乎相同,第一个是显示的inner join,后一个是隐式的inner join。
2.left join/right join VS inner join
    尽量用inner join.避免 left join 和 null.

 

    在使用left join(或right join)时,应该清楚的知道以下几点:

(1). on与 where的执行顺序
    ON 条件(“A LEFT JOIN B ON 条件表达式”中的ON)用来决定如何从 B 表中检索数据行。如果 B 表中没有任何一行数据匹配 ON 的条件,将会额外生成一行所有列为 NULL 的数据,在匹配阶段 WHERE 子句的条件都不会被使用。仅在匹配阶段完成以后,WHERE 子句条件才会被使用。它将从匹配阶段产生的数据中检索过滤。
    所以我们要注意:在使用Left (right) join的时候,一定要在先给出尽可能多的匹配满足条件,减少Where的执行。如:

 
  1. select * from A  
  2. inner join B on B.name = A.name  
  3. left join C on C.name = B.name  
  4. left join D on D.id = C.id  
  5. where C.status>1 and D.status=1;  

下面这种写法更省时

[sql]  view plain  copy
 
  1. select * from A  
  2. inner join B on B.name = A.name  
  3. left join C on C.name = B.name and C.status>1  
  4. left join D on D.id = C.id and D.status=1  

(2).注意ON 子句和 WHERE 子句的不同

[sql]  view plain  copy
 
  1. mysql> SELECT * FROM product LEFT JOIN product_details  
  2.        ON (product.id = product_details.id)  
  3.        AND product_details.id=2;  
  4. +----+--------+------+--------+-------+  
  5. | id | amount | id   | weight | exist |  
  6. +----+--------+------+--------+-------+  
  7. |  1 |    100 | NULL |   NULL |  NULL |  
  8. |  2 |    200 |    2 |     22 |     0 |  
  9. |  3 |    300 | NULL |   NULL |  NULL |  
  10. |  4 |    400 | NULL |   NULL |  NULL |  
  11. +----+--------+------+--------+-------+  
  12. rows in set (0.00 sec)  
  13.    
  14. mysql> SELECT * FROM product LEFT JOIN product_details  
  15.        ON (product.id = product_details.id)  
  16.        WHERE product_details.id=2;  
  17. +----+--------+----+--------+-------+  
  18. | id | amount | id | weight | exist |  
  19. +----+--------+----+--------+-------+  
  20. |  2 |    200 |  2 |     22 |     0 |  
  21. +----+--------+----+--------+-------+  
  22. 1 row in set (0.01 sec)  

    从上可知,第一条查询使用 ON 条件决定了从 LEFT JOIN的 product_details表中检索符合的所有数据行。第二条查询做了简单的LEFT JOIN,然后使用 WHERE 子句从 LEFT JOIN的数据中过滤掉不符合条件的数据行。
(3).尽量避免子查询,而用join
    往往性能这玩意儿,更多时候体现在数据量比较大的时候,此时,我们应该避免复杂的子查询。如下:

 

 

 
  1. insert into t1(a1) select b1 from t2   
  2. where not exists(select 1 from t1 where t1.id = t2.r_id);  

下面这个更好

 
    1. insert into t1(a1)    
    2. select b1 from t2    
    3. left join (select distinct t1.id from t1 ) t1 on t1.id = t2.r_id     
    4. where t1.id is null;    

Apache beam Join.leftOuterJoin,如何传递空白TableRow?

Apache beam Join.leftOuterJoin,如何传递空白TableRow?

如何解决Apache beam Join.leftOuterJoin,如何传递空白TableRow??

我想在 Apache Beam (JAVA Sdk) 中对 2 个 BigQuery 表执行 leftOuterJoin。

  1. 我阅读了表格 (leftTableCollection & rightTableCollection) 并且属于 PCollection<TableRow> 类型。

  2. 将它们转换为 PCollection<KV<String,TableRow>> 形式,即

  3. 我正在使用来自 Join.leftOuterJoin

    org.apache.beam.sdk.extensions.joinlibrary.Join

`

PCollection<KV<String,TableRow>> leftTableKVCollection =
    leftTableCollection.apply(ParDo.of(new ExtractLeftTableDoFn()));
PCollection<KV<String,TableRow>> rightTableKVCollection =
    rightTableCollection.apply(ParDo.of(new ExtractRightTableDoFn()));

TableRow rightTableNullValues = ???

PCollection<KV<String,KV<TableRow,TableRow>>> joinKvResults =
    Join.leftOuterJoin(leftTableKVCollection,rightTableKVCollection,rightTableNullValues);

`

我不知道要为 rightTableNullValues 传递什么?

尝试使用 new TableRow() 抛出 unable to serialize 错误。

任何建议都会非常有帮助。 TIA

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

Hive_LEFT SEMI JOIN / LEFT OUTER JOIN 与 (IN / NOT IN), (EXISTS / NOT EXISTS ) 分析

Hive_LEFT SEMI JOIN / LEFT OUTER JOIN 与 (IN / NOT IN), (EXISTS / NOT EXISTS ) 分析

 

参考文章 : https://blog.csdn.net/happyrocking/article/details/79885071

 

本篇文章,我们主要就 Hive 中的  LEFT SEMI JOIN 和  (IN / NOT IN), (EXISTS / NOT EXISTS ) 子句查询做一个了解。

 

LEFT SEMI JOIN 基本认识

首先,我们先要了解下什么是 LEFT SEMI JOIN.

 

特点

1、left semi join 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行。

2、left semi join 是只传递表的 join key 给 map 阶段,因此left semi join 中最后 select 的结果只许出现左表。

3、因为 left semi join 是 in(keySet) 的关系,遇到右表重复记录,左表会跳过,而 join 则会一直遍历。这就导致右表有重复值得情况下 left semi join 只产生一条,join 会产生多条,也会导致 left semi join 的性能更高。 

比如以下A表和B表进行 join 或 left semi join,然后 select 出所有字段,结果区别如下:
 

注意:蓝色叉的那一列实际是不存在left semi join中的,因为最后 select 的结果只许出现左表。

 

 

 

 

 

其实可以这么认为 LEFT SEMI JOIN 就是 子查询形式的 (IN / NOT IN), (EXISTS / NOT EXISTS ) 的替代方案。

因为 HIVE 0.13 版本之前,是不支持 (IN / NOT IN), (EXISTS / NOT EXISTS ) 中存在子查询语句的,此时我们需要使用 LEFT SEMI JOIN

文档如下:

 

构建基础的测试数据

DROP TABLE IF EXISTS data_semi_a;
  • DROP TABLE IF EXISTS data_semi_b;
  • CREATE TABLE IF NOT EXISTS data_semi_a
  • (
  • user_id BIGINT
  • ,sex_id BIGINT
  • );
  • CREATE TABLE IF NOT EXISTS data_semi_b
  • (
  • user_id BIGINT
  • ,sex_id BIGINT
  • ,age BIGINT
  • );
  • INSERT INTO TABLE data_semi_a VALUES
  • (NULL ,0)
  • ,(1, 1)
  • ,(1, 0)
  • ,(2, 1)
  • ,(3, 0)
  • ,(4, 1)
  • ;
  • INSERT INTO TABLE data_semi_b VALUES
  • (NULL, 0, 3)
  • ,(1, 0, 12)
  • ,(2, 1, 14)
  • ;
  •  

    测试数据:

    data_semi_a

    1. +----------------------+---------------------+
    2. | data_semi_a.user_id | data_semi_a.sex_id |
    3. +----------------------+---------------------+
    4. | NULL | 0 |
    5. | 1 | 1 |
    6. | 1 | 0 |
    7. | 2 | 1 |
    8. | 3 | 0 |
    9. | 4 | 1 |
    10. +----------------------+---------------------+

     

    data_semi_b

    1. +----------------------+---------------------+------------------+
    2. | data_semi_b.user_id | data_semi_b.sex_id | data_semi_b.age |
    3. +----------------------+---------------------+------------------+
    4. | NULL | 0 | 3 |
    5. | 1 | 0 | 12 |
    6. | 2 | 1 | 14 |
    7. +----------------------+---------------------+------------------+

     

     

     

     

     

    单条件的 LEFT SEMI JOIN  相当于 (IN )

     

    注意

    LEFT SEMI JOIN 等同于 IN ,其原理是 只传递 LEFT SEMI JOIN 中的 KEY 。

    所以 A LEFT SEMI JOIN B , SELECT 语句中 不能出现B 中的字段。

    1. SELECT 
    2.  a.user_id
    3.  ,a.sex_id
    4.  ,b.age
    5. FROM data_semi_a AS a
    6. LEFT SEMI JOIN data_semi_b AS b
    7.  ON a.user_id = b.user_id
    8. ;

     

    Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 4:1 Invalid table alias or column reference 'b': (possible column names are: user_id, sex_id) (state=42000,code=10004)

     

     

    单条件的 LEFT SEMI JOIN  相当于 (IN )   , 例如如下SQL

    SQL 语句

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. LEFT SEMI JOIN data_semi_b AS b
    6. ON a.user_id = b.user_id
    7. ;

    等价的 IN SQL

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. WHERE a.user_id IN (
    6. SELECT b.user_id
    7. FROM data_semi_b AS b
    8. );

     

    我们比较下2个SQL 的运行结果

    LEFT SEMI JOIN 的执行结果

    1. INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2. INFO : 2020-04-12 10:53:09,591 Stage-1 map = 0%, reduce = 0%
    3. INFO : 2020-04-12 10:53:17,849 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.12 sec
    4. INFO : 2020-04-12 10:53:22,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.84 sec
    5. INFO : 2020-04-12 10:53:29,141 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.77 sec
    6. INFO : MapReduce Total cumulative CPU time: 7 seconds 770 msec
    7. INFO : Ended Job = job_1586423165261_0087
    8. INFO : MapReduce Jobs Launched:
    9. INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.77 sec HDFS Read: 16677 HDFS Write: 135 SUCCESS
    10. INFO : Total MapReduce CPU Time Spent: 7 seconds 770 msec
    11. INFO : Completed executing command(queryId=hive_20200412105301_9f643e42-c966-4140-8c72-330be6bdd73c); Time taken: 28.939 seconds
    12. INFO : OK
    13. +------------+-----------+
    14. | a.user_id | a.sex_id |
    15. +------------+-----------+
    16. | 1 | 0 |
    17. | 1 | 1 |
    18. | 2 | 1 |
    19. +------------+-----------+
    20. 3 rows selected (29.073 seconds)

     

    IN 的执行结果

    1. INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2. INFO : 2020-04-12 10:37:26,143 Stage-1 map = 0%, reduce = 0%
    3. INFO : 2020-04-12 10:37:33,376 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.71 sec
    4. INFO : 2020-04-12 10:37:39,510 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.6 sec
    5. INFO : 2020-04-12 10:37:44,680 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.41 sec
    6. INFO : MapReduce Total cumulative CPU time: 7 seconds 410 msec
    7. INFO : Ended Job = job_1586423165261_0085
    8. INFO : MapReduce Jobs Launched:
    9. INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.41 sec HDFS Read: 16726 HDFS Write: 135 SUCCESS
    10. INFO : Total MapReduce CPU Time Spent: 7 seconds 410 msec
    11. INFO : Completed executing command(queryId=hive_20200412103717_2ab604da-f301-4fee-b9bd-9c22ad6e65a1); Time taken: 27.796 seconds
    12. INFO : OK
    13. +------------+-----------+
    14. | a.user_id | a.sex_id |
    15. +------------+-----------+
    16. | 1 | 0 |
    17. | 1 | 1 |
    18. | 2 | 1 |
    19. +------------+-----------+
    20. 3 rows selected (27.902 seconds)

     

    我们再看下两个语句的 EXPLAIN 结果:

    LEFT SEMI JOIN 的 EXPLAIN 结果:

    1. INFO : Starting task [Stage-3:EXPLAIN] in serial mode
    2. INFO : Completed executing command(queryId=hive_20200412105949_53e51917-8c04-4f6f-b9fd-32ab71a2888b); Time taken: 0.005 seconds
    3. INFO : OK
    4. +----------------------------------------------------+
    5. | Explain |
    6. +----------------------------------------------------+
    7. | STAGE DEPENDENCIES: |
    8. | Stage-1 is a root stage |
    9. | Stage-0 depends on stages: Stage-1 |
    10. | |
    11. | STAGE PLANS: |
    12. | Stage: Stage-1 |
    13. | Map Reduce |
    14. | Map Operator Tree: |
    15. | TableScan |
    16. | alias: a |
    17. | filterExpr: user_id is not null (type: boolean) |
    18. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    19. | Filter Operator |
    20. | predicate: user_id is not null (type: boolean) |
    21. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    22. | Reduce Output Operator |
    23. | key expressions: user_id (type: bigint) |
    24. | sort order: + |
    25. | Map-reduce partition columns: user_id (type: bigint) |
    26. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    27. | value expressions: sex_id (type: bigint) |
    28. | TableScan |
    29. | alias: b |
    30. | filterExpr: user_id is not null (type: boolean) |
    31. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    32. | Filter Operator |
    33. | predicate: user_id is not null (type: boolean) |
    34. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    35. | Select Operator |
    36. | expressions: user_id (type: bigint) |
    37. | outputColumnNames: _col0 |
    38. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    39. | Group By Operator |
    40. | keys: _col0 (type: bigint) |
    41. | mode: hash |
    42. | outputColumnNames: _col0 |
    43. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    44. | Reduce Output Operator |
    45. | key expressions: _col0 (type: bigint) |
    46. | sort order: + |
    47. | Map-reduce partition columns: _col0 (type: bigint) |
    48. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    49. | Reduce Operator Tree: |
    50. | Join Operator |
    51. | condition map: |
    52. | Left Semi Join 0 to 1 |
    53. | keys: |
    54. | 0 user_id (type: bigint) |
    55. | 1 _col0 (type: bigint) |
    56. | outputColumnNames: _col0, _col1 |
    57. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    58. | File Output Operator |
    59. | compressed: false |
    60. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    61. | table: |
    62. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    63. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    64. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    65. | |
    66. | Stage: Stage-0 |
    67. | Fetch Operator |
    68. | limit: -1 |
    69. | Processor Tree: |
    70. | ListSink |
    71. | |
    72. +----------------------------------------------------+
    73. 65 rows selected (0.136 seconds)

    IN 的 EXPLAIN 结果:

    1. INFO : Starting task [Stage-3:EXPLAIN] in serial mode
    2. INFO : Completed executing command(queryId=hive_20200412110229_81d9cf79-50e2-46f1-8152-a399038861c7); Time taken: 0.005 seconds
    3. INFO : OK
    4. +----------------------------------------------------+
    5. | Explain |
    6. +----------------------------------------------------+
    7. | STAGE DEPENDENCIES: |
    8. | Stage-1 is a root stage |
    9. | Stage-0 depends on stages: Stage-1 |
    10. | |
    11. | STAGE PLANS: |
    12. | Stage: Stage-1 |
    13. | Map Reduce |
    14. | Map Operator Tree: |
    15. | TableScan |
    16. | alias: a |
    17. | filterExpr: user_id is not null (type: boolean) |
    18. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    19. | Filter Operator |
    20. | predicate: user_id is not null (type: boolean) |
    21. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    22. | Reduce Output Operator |
    23. | key expressions: user_id (type: bigint) |
    24. | sort order: + |
    25. | Map-reduce partition columns: user_id (type: bigint) |
    26. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    27. | value expressions: sex_id (type: bigint) |
    28. | TableScan |
    29. | alias: b |
    30. | filterExpr: user_id is not null (type: boolean) |
    31. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    32. | Filter Operator |
    33. | predicate: user_id is not null (type: boolean) |
    34. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    35. | Select Operator |
    36. | expressions: user_id (type: bigint) |
    37. | outputColumnNames: _col0 |
    38. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    39. | Group By Operator |
    40. | keys: _col0 (type: bigint) |
    41. | mode: hash |
    42. | outputColumnNames: _col0 |
    43. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    44. | Reduce Output Operator |
    45. | key expressions: _col0 (type: bigint) |
    46. | sort order: + |
    47. | Map-reduce partition columns: _col0 (type: bigint) |
    48. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    49. | Reduce Operator Tree: |
    50. | Join Operator |
    51. | condition map: |
    52. | Left Semi Join 0 to 1 |
    53. | keys: |
    54. | 0 user_id (type: bigint) |
    55. | 1 _col0 (type: bigint) |
    56. | outputColumnNames: _col0, _col1 |
    57. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    58. | File Output Operator |
    59. | compressed: false |
    60. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    61. | table: |
    62. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    63. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    64. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    65. | |
    66. | Stage: Stage-0 |
    67. | Fetch Operator |
    68. | limit: -1 |
    69. | Processor Tree: |
    70. | ListSink |
    71. | |
    72. +----------------------------------------------------+
    73. 65 rows selected (0.127 seconds)

    可以看到两者在执行结果 和 EXPLAIN 结果上是完全一致的。

    其实 IN 内部也是使用 的 LEFT SEMI JOIN

     

     

     

     

    LEFT OUTER JOIN 实现 NOT IN 

     

    注意 LEFT SEMI JOIN 不能实现 NOT IN 

    本质 : Hive 中不支持不等值连接!!!

     

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. LEFT SEMI JOIN data_semi_b AS b
    6. ON (a.user_id != b.user_id)
    7. ;

    Error: Error while compiling statement: FAILED: SemanticException [Error 10017]: Line 6:4 Both left and right aliases encountered in JOIN 'user_id' (state=42000,code=10017)

     

     

    正确写法

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. WHERE a.user_id NOT IN (
    6. SELECT b.user_id
    7. FROM data_semi_b AS b
    8. );
    1. INFO : Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
    2. INFO : 2020-04-12 23:02:26,751 Stage-2 map = 0%, reduce = 0%
    3. INFO : 2020-04-12 23:02:33,938 Stage-2 map = 50%, reduce = 0%, Cumulative CPU 1.76 sec
    4. INFO : 2020-04-12 23:02:39,172 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.35 sec
    5. INFO : 2020-04-12 23:02:47,688 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 7.88 sec
    6. INFO : MapReduce Total cumulative CPU time: 7 seconds 880 msec
    7. INFO : Ended Job = job_1586423165261_0106
    8. INFO : MapReduce Jobs Launched:
    9. INFO : Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 6.49 sec HDFS Read: 8372 HDFS Write: 96 SUCCESS
    10. INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.65 sec HDFS Read: 11974 HDFS Write: 96 SUCCESS
    11. INFO : Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 7.88 sec HDFS Read: 14131 HDFS Write: 87 SUCCESS
    12. INFO : Total MapReduce CPU Time Spent: 20 seconds 20 msec
    13. INFO : Completed executing command(queryId=hive_20200412230117_fef818dc-e433-4880-9c8d-f6a9d28a08a9); Time taken: 91.471 seconds
    14. INFO : OK
    15. +------------+-----------+
    16. | a.user_id | a.sex_id |
    17. +------------+-----------+
    18. +------------+-----------+
    19. No rows selected (91.674 seconds)

     

    等价的 SQL ,  注意 NOT IN 不能使用 LEFT SEMI JOIN 实现,我们需要使用 LEFT OUTER JOIN 进行实现:

    等价的LEFT OUTER JOIN   的 SQL

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. LEFT OUTER JOIN data_semi_b AS b
    6. ON a.user_id = b.user_id
    7. AND b.user_id IS NULL
    8. WHERE a.user_id IS NOT NULL
    9. AND b.user_id IS NOT NULL
    10. ;
    1. INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2. INFO : 2020-04-12 23:04:47,896 Stage-1 map = 0%, reduce = 0%
    3. INFO : 2020-04-12 23:04:55,176 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.91 sec
    4. INFO : 2020-04-12 23:05:00,288 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.53 sec
    5. INFO : 2020-04-12 23:05:06,449 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.45 sec
    6. INFO : MapReduce Total cumulative CPU time: 8 seconds 450 msec
    7. INFO : Ended Job = job_1586423165261_0107
    8. INFO : MapReduce Jobs Launched:
    9. INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 8.45 sec HDFS Read: 16358 HDFS Write: 87 SUCCESS
    10. INFO : Total MapReduce CPU Time Spent: 8 seconds 450 msec
    11. INFO : Completed executing command(queryId=hive_20200412230438_62ce326e-1b03-4c5a-a842-6816dc6feda3); Time taken: 28.871 seconds
    12. INFO : OK
    13. +------------+-----------+
    14. | a.user_id | a.sex_id |
    15. +------------+-----------+
    16. +------------+-----------+
    17. No rows selected (28.979 seconds)

     

    我们看下这两个SQL 的执行过程

    NOT IN 的 EXPLAIN 结果:

    1. +----------------------------------------------------+
    2. | Explain |
    3. +----------------------------------------------------+
    4. | STAGE DEPENDENCIES: |
    5. | Stage-4 is a root stage |
    6. | Stage-1 depends on stages: Stage-4 |
    7. | Stage-2 depends on stages: Stage-1 |
    8. | Stage-0 depends on stages: Stage-2 |
    9. | |
    10. | STAGE PLANS: |
    11. | Stage: Stage-4 |
    12. | Map Reduce |
    13. | Map Operator Tree: |
    14. | TableScan |
    15. | alias: b |
    16. | filterExpr: user_id is null (type: boolean) |
    17. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    18. | Filter Operator |
    19. | predicate: user_id is null (type: boolean) |
    20. | Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
    21. | Select Operator |
    22. | Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
    23. | Group By Operator |
    24. | aggregations: count() |
    25. | mode: hash |
    26. | outputColumnNames: _col0 |
    27. | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
    28. | Reduce Output Operator |
    29. | sort order: |
    30. | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
    31. | value expressions: _col0 (type: bigint) |
    32. | Reduce Operator Tree: |
    33. | Group By Operator |
    34. | aggregations: count(VALUE._col0) |
    35. | mode: mergepartial |
    36. | outputColumnNames: _col0 |
    37. | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
    38. | Filter Operator |
    39. | predicate: (_col0 = 0) (type: boolean) |
    40. | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
    41. | Select Operator |
    42. | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
    43. | Group By Operator |
    44. | keys: 0 (type: bigint) |
    45. | mode: hash |
    46. | outputColumnNames: _col0 |
    47. | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
    48. | File Output Operator |
    49. | compressed: false |
    50. | table: |
    51. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    52. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    53. | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
    54. | |
    55. | Stage: Stage-1 |
    56. | Map Reduce |
    57. | Map Operator Tree: |
    58. | TableScan |
    59. | alias: a |
    60. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    61. | Reduce Output Operator |
    62. | sort order: |
    63. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    64. | value expressions: user_id (type: bigint), sex_id (type: bigint) |
    65. | TableScan |
    66. | Reduce Output Operator |
    67. | sort order: |
    68. | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
    69. | Reduce Operator Tree: |
    70. | Join Operator |
    71. | condition map: |
    72. | Left Semi Join 0 to 1 |
    73. | keys: |
    74. | 0 |
    75. | 1 |
    76. | outputColumnNames: _col0, _col1 |
    77. | Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
    78. | File Output Operator |
    79. | compressed: false |
    80. | table: |
    81. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    82. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    83. | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
    84. | |
    85. | Stage: Stage-2 |
    86. | Map Reduce |
    87. | Map Operator Tree: |
    88. | TableScan |
    89. | Reduce Output Operator |
    90. | key expressions: _col0 (type: bigint) |
    91. | sort order: + |
    92. | Map-reduce partition columns: _col0 (type: bigint) |
    93. | Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
    94. | value expressions: _col1 (type: bigint) |
    95. | TableScan |
    96. | alias: b |
    97. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    98. | Select Operator |
    99. | expressions: user_id (type: bigint) |
    100. | outputColumnNames: _col0 |
    101. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    102. | Reduce Output Operator |
    103. | key expressions: _col0 (type: bigint) |
    104. +----------------------------------------------------+
    105. | Explain |
    106. +----------------------------------------------------+
    107. | sort order: + |
    108. | Map-reduce partition columns: _col0 (type: bigint) |
    109. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    110. | Reduce Operator Tree: |
    111. | Join Operator |
    112. | condition map: |
    113. | Left Outer Join0 to 1 |
    114. | keys: |
    115. | 0 _col0 (type: bigint) |
    116. | 1 _col0 (type: bigint) |
    117. | outputColumnNames: _col0, _col1, _col5 |
    118. | Statistics: Num rows: 6 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
    119. | Filter Operator |
    120. | predicate: _col5 is null (type: boolean) |
    121. | Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
    122. | Select Operator |
    123. | expressions: _col0 (type: bigint), _col1 (type: bigint) |
    124. | outputColumnNames: _col0, _col1 |
    125. | Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
    126. | File Output Operator |
    127. | compressed: false |
    128. | Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
    129. | table: |
    130. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    131. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    132. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    133. | |
    134. | Stage: Stage-0 |
    135. | Fetch Operator |
    136. | limit: -1 |
    137. | Processor Tree: |
    138. | ListSink |
    139. | |
    140. +----------------------------------------------------+

     

    LEFT OUTER JOIN 的 EXPLAIN 结果:

    1. +----------------------------------------------------+
    2. | Explain |
    3. +----------------------------------------------------+
    4. | STAGE DEPENDENCIES: |
    5. | Stage-1 is a root stage |
    6. | Stage-0 depends on stages: Stage-1 |
    7. | |
    8. | STAGE PLANS: |
    9. | Stage: Stage-1 |
    10. | Map Reduce |
    11. | Map Operator Tree: |
    12. | TableScan |
    13. | alias: a |
    14. | filterExpr: user_id is not null (type: boolean) |
    15. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    16. | Filter Operator |
    17. | predicate: user_id is not null (type: boolean) |
    18. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    19. | Reduce Output Operator |
    20. | key expressions: user_id (type: bigint) |
    21. | sort order: + |
    22. | Map-reduce partition columns: user_id (type: bigint) |
    23. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    24. | value expressions: sex_id (type: bigint) |
    25. | TableScan |
    26. | alias: b |
    27. | filterExpr: (user_id is null and user_id is not null) (type: boolean) |
    28. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    29. | Filter Operator |
    30. | predicate: (user_id is null and user_id is not null) (type: boolean) |
    31. | Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
    32. | Reduce Output Operator |
    33. | key expressions: user_id (type: bigint) |
    34. | sort order: + |
    35. | Map-reduce partition columns: user_id (type: bigint) |
    36. | Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
    37. | Reduce Operator Tree: |
    38. | Join Operator |
    39. | condition map: |
    40. | Left Outer Join0 to 1 |
    41. | keys: |
    42. | 0 user_id (type: bigint) |
    43. | 1 user_id (type: bigint) |
    44. | outputColumnNames: _col0, _col1, _col5 |
    45. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    46. | Filter Operator |
    47. | predicate: _col5 is not null (type: boolean) |
    48. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    49. | Select Operator |
    50. | expressions: _col0 (type: bigint), _col1 (type: bigint) |
    51. | outputColumnNames: _col0, _col1 |
    52. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    53. | File Output Operator |
    54. | compressed: false |
    55. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    56. | table: |
    57. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    58. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    59. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    60. | |
    61. | Stage: Stage-0 |
    62. | Fetch Operator |
    63. | limit: -1 |
    64. | Processor Tree: |
    65. | ListSink |
    66. | |
    67. +----------------------------------------------------+
    68. 63 rows selected (0.143 seconds)

     

     

     

     

     

     

    LEFT SEMI JOIN 实现多条件 IN , 即 EXISTS 

     

    注意: IN 只能用户单列,如果是多列的话,我们需要使用 EXISTS

     

    如下的IN 的SQL 是错误的

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. WHERE (a.user_id, a.sex_id) IN (
    6. SELECT
    7. a.user_id
    8. ,a.sex_id
    9. FROM data_semi_b AS b
    10. )
    11. ;

    Error: Error while compiling statement: FAILED: ParseException line 6:0 mismatched input 'SELECT' expecting ( near '(' in expression specification (state=42000,code=40000)

     

     

    我们需要用如下的形式,

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. LEFT SEMI JOIN data_semi_b AS b
    6. ON a.user_id = b.user_id
    7. AND a.sex_id = b.sex_id
    8. ;

    或者

    1. SELECT
    2. a.user_id
    3. ,a.sex_id
    4. FROM data_semi_a AS a
    5. WHERE EXISTS (
    6. SELECT 1
    7. FROM data_semi_b AS b
    8. WHERE
    9. a.user_id = b.user_id
    10. AND a.sex_id = b.sex_id
    11. )
    12. ;

    运行结果

    1. INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2. INFO : 2020-04-12 23:46:16,157 Stage-1 map = 0%, reduce = 0%
    3. INFO : 2020-04-12 23:46:24,375 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.04 sec
    4. INFO : 2020-04-12 23:46:28,545 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.84 sec
    5. INFO : 2020-04-12 23:46:35,732 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.85 sec
    6. INFO : MapReduce Total cumulative CPU time: 7 seconds 850 msec
    7. INFO : Ended Job = job_1586423165261_0110
    8. INFO : MapReduce Jobs Launched:
    9. INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.85 sec HDFS Read: 17951 HDFS Write: 119 SUCCESS
    10. INFO : Total MapReduce CPU Time Spent: 7 seconds 850 msec
    11. INFO : Completed executing command(queryId=hive_20200412234607_8b6acba0-54bb-420f-80df-a5efd5dc9ae5); Time taken: 29.286 seconds
    12. INFO : OK
    13. +------------+-----------+
    14. | a.user_id | a.sex_id |
    15. +------------+-----------+
    16. | 1 | 0 |
    17. | 2 | 1 |
    18. +------------+-----------+
    19. 2 rows selected (29.379 seconds)

     

     

    我们看下两种方式 的 EXPLAIN 结果 :

    LEFT SEMI JOIN

    1. +----------------------------------------------------+
    2. | Explain |
    3. +----------------------------------------------------+
    4. | STAGE DEPENDENCIES: |
    5. | Stage-1 is a root stage |
    6. | Stage-0 depends on stages: Stage-1 |
    7. | |
    8. | STAGE PLANS: |
    9. | Stage: Stage-1 |
    10. | Map Reduce |
    11. | Map Operator Tree: |
    12. | TableScan |
    13. | alias: a |
    14. | filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
    15. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    16. | Filter Operator |
    17. | predicate: (user_id is not null and sex_id is not null) (type: boolean) |
    18. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    19. | Reduce Output Operator |
    20. | key expressions: user_id (type: bigint), sex_id (type: bigint) |
    21. | sort order: ++ |
    22. | Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
    23. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    24. | TableScan |
    25. | alias: b |
    26. | filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
    27. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    28. | Filter Operator |
    29. | predicate: (user_id is not null and sex_id is not null) (type: boolean) |
    30. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    31. | Select Operator |
    32. | expressions: user_id (type: bigint), sex_id (type: bigint) |
    33. | outputColumnNames: _col0, _col1 |
    34. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    35. | Group By Operator |
    36. | keys: _col0 (type: bigint), _col1 (type: bigint) |
    37. | mode: hash |
    38. | outputColumnNames: _col0, _col1 |
    39. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    40. | Reduce Output Operator |
    41. | key expressions: _col0 (type: bigint), _col1 (type: bigint) |
    42. | sort order: ++ |
    43. | Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
    44. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    45. | Reduce Operator Tree: |
    46. | Join Operator |
    47. | condition map: |
    48. | Left Semi Join 0 to 1 |
    49. | keys: |
    50. | 0 user_id (type: bigint), sex_id (type: bigint) |
    51. | 1 _col0 (type: bigint), _col1 (type: bigint) |
    52. | outputColumnNames: _col0, _col1 |
    53. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    54. | File Output Operator |
    55. | compressed: false |
    56. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    57. | table: |
    58. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    59. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    60. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    61. | |
    62. | Stage: Stage-0 |
    63. | Fetch Operator |
    64. | limit: -1 |
    65. | Processor Tree: |
    66. | ListSink |
    67. | |
    68. +----------------------------------------------------+
    69. 64 rows selected (0.121 seconds)

     

    EXISTS 

    1. +----------------------------------------------------+
    2. | Explain |
    3. +----------------------------------------------------+
    4. | STAGE DEPENDENCIES: |
    5. | Stage-1 is a root stage |
    6. | Stage-0 depends on stages: Stage-1 |
    7. | |
    8. | STAGE PLANS: |
    9. | Stage: Stage-1 |
    10. | Map Reduce |
    11. | Map Operator Tree: |
    12. | TableScan |
    13. | alias: a |
    14. | filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
    15. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    16. | Filter Operator |
    17. | predicate: (user_id is not null and sex_id is not null) (type: boolean) |
    18. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    19. | Reduce Output Operator |
    20. | key expressions: user_id (type: bigint), sex_id (type: bigint) |
    21. | sort order: ++ |
    22. | Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
    23. | Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
    24. | TableScan |
    25. | alias: b |
    26. | filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
    27. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    28. | Filter Operator |
    29. | predicate: (user_id is not null and sex_id is not null) (type: boolean) |
    30. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    31. | Select Operator |
    32. | expressions: user_id (type: bigint), sex_id (type: bigint) |
    33. | outputColumnNames: _col0, _col1 |
    34. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    35. | Group By Operator |
    36. | keys: _col0 (type: bigint), _col1 (type: bigint) |
    37. | mode: hash |
    38. | outputColumnNames: _col0, _col1 |
    39. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    40. | Reduce Output Operator |
    41. | key expressions: _col0 (type: bigint), _col1 (type: bigint) |
    42. | sort order: ++ |
    43. | Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
    44. | Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
    45. | Reduce Operator Tree: |
    46. | Join Operator |
    47. | condition map: |
    48. | Left Semi Join 0 to 1 |
    49. | keys: |
    50. | 0 user_id (type: bigint), sex_id (type: bigint) |
    51. | 1 _col0 (type: bigint), _col1 (type: bigint) |
    52. | outputColumnNames: _col0, _col1 |
    53. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    54. | File Output Operator |
    55. | compressed: false |
    56. | Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
    57. | table: |
    58. | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    59. | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    60. | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    61. | |
    62. | Stage: Stage-0 |
    63. | Fetch Operator |
    64. | limit: -1 |
    65. | Processor Tree: |
    66. | ListSink |
    67. | |
    68. +----------------------------------------------------+
    69. 64 rows selected (0.147 seconds)

     

    可以看到两种方式的执行计划是一致的!!!

     

     

     

     

    Hive的join操作,left join,right join,inner join

    Hive的join操作,left join,right join,inner join

    hive执行引擎会将hql转化成mapreduce作业

    map side join除外,map负责分发数据,具体的join在reduce端进行

    1.如果多表基于不同的列做join,则无法在一轮mapreduce将相关数据shuffle到统一一个reducer,对于多表的join,hive会将前面的表缓存在reducer的内存中,然后后面的表会流式的进入reducer做join.为了防止oom,通常将大表放在最后。

    hive中分为map join和common join

    common join

    map阶段:

    读取源表数据,

    map输出以join on条件中的列为key,如果join有多个关联键,则以这些关联键的组合作为key.

    value为 join之后所关心的列,也会包含tag信息,用于标注value对应哪个表

    按key进行排序

    shuffle阶段:

    根据key进行hash,根据hash推送到不同的reduce中,保证两个表中相同的key在同一个reduce中

    reduce阶段:

    通过key完成join,通过tag来识别不同表的数据

    map join

    通常用于一个小表join大表,小表标准由hive.mapjoin.smalltable.filesize决定,设置hive.auto.convert.join为true,小表达到标准会自动转为mapjoin

    mapjoin可以看做broadcast join,将小表数据加载到内存,没有shuffle过程,加快处理速度,但是数据量大容易出现oom,

    普通join会有shuffle,影响效率,同时会出现数据倾斜

    今天关于'LEFT JOIN'vs'LEFT OUTER JOIN'的讲解已经结束,谢谢您的阅读,如果想了解更多关于(转载) Mysql----Join用法(Inner join,Left join,Right join, Cross join, Union模拟Full join)及---性能优化、Apache beam Join.leftOuterJoin,如何传递空白TableRow?、Hive_LEFT SEMI JOIN / LEFT OUTER JOIN 与 (IN / NOT IN), (EXISTS / NOT EXISTS ) 分析、Hive的join操作,left join,right join,inner join的相关知识,请在本站搜索。

    本文标签:

    上一篇3表联接与SUM和GROUP BY不起作用(连表group by)

    下一篇用yii中的CSqlDataProvider排序(select count 排序)