亲宝软件园·资讯

展开

Impala的count(distinct QUESTION_ID) 与ndv(QUESTION_ID)

张江英 人气:0

在impala中,一个select执行多个count(distinct col)会报错,举例:

select C_DEPT2,
         count(distinct QUESTION_BUSI_ID) as wo_num,
         count(distinct CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2

报错信息:

ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT QUESTION_BUSI_ID); deviating function: count(DISTINCT CREATOR_ID)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to perform this rewrite automatically.

这时候,可通过以下方法解决:

1、得到的是近似值,数据量越大越不准确:

(1)SQL运行前,先运行命令:set APPX_COUNT_DISTINCT=true;

set APPX_COUNT_DISTINCT=true;
select C_DEPT2,
       count(distinct QUESTION_BUSI_ID) as wo_num,
       count(distinct CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

 

 

(2)将count(distinct col)用函数ndv(col)代替

select C_DEPT2,
       ndv(QUESTION_BUSI_ID) as wo_num,
       ndv(CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

需要注意的是,在set APPX_COUNT_DISTINCT=true;的情况下,使用count(distinct col)会自动转化成ndv(col),得到的是近似值,所以以上两种方法的结果数据一致。

2、精确值。拆分为子查询,再关联,如下:

set APPX_COUNT_DISTINCT = false; -- 将参数置为false,使用count(distinct col),确保不会转化成ndv(col)
select a.C_DEPT2, a.wo_num, b.creator_num from (select C_DEPT2, count(distinct QUESTION_BUSI_ID) as wo_num from pdm.kudu_q_basic where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2) a left join (select C_DEPT2, count(distinct CREATOR_ID) as creator_num from pdm.kudu_q_basic where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2) b on a.C_DEPT2 = b.C_DEPT2 order by a.C_DEPT2

 

 

 

验证:

select C_DEPT2, count(*)
  from pdm.kudu_q_basic -- 表中无重复数据
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

 

 

总结:解决在impala中一个select执行多个count(distinct col)报错问题,可以用过设置参数set APPX_COUNT_DISTINCT = true;或将count(distinct col)用ndv(col)解决,但得到的是近似值,不准确。还可以通过分别在子查询中进行count(distinct col)再关联得到准确值,但要注意参数 APPX_COUNT_DISTINCT = false,不然会自动转化为ndv(col)得到的还是近似值。

 

加载全部内容

相关教程
猜你喜欢
用户评论