关于数据仓库数据质量的问题探讨-技术开发专区

关于数据仓库数据质量的问题探讨

作者：IT168 黄鹏编辑：晓熊 2007-03-07 13:32

四、处理脚本
实际上对于NULL值，缺乏一致性，数据范围的处理，可以通过简单的语句来完成。
Example:

-- Create a original table, only include ID and two measure fields. 
-- here ID is one of attribute columns only, so it can be Null values.
-- two measure fields have no constraint or check rule, they may be 
-- NULL or out of Scope, on the assumption that the scope value of 
-- measure1 and measure2 field is between 0 and 100.
Create table table_original
(
  ID         INT,
  Measure1   LONG,
  Measure2   LONG
)
-- Create a dimension table, only include ID and Name fields.
CREATE TABLE table_dimension
(
  ID        INT CONSTRAINT Pri_Table_Dimension PRIMARY KEY,
  Name      VARCHAR2 (20)
)
-- Create a fact table, Only include ID and two Measure. It will save 
-- compute or count result.
Create table table_fact
(
  ID         INT   NOT NULL,
  Measure1   LONG,
  Measure2   LONG
)
--Insert a sample data into dimension table and original table.
INSERT INTO table_dimension VALUES (1,'a');
INSERT INTO table_dimension VALUES (2,'b');
COMMIT;
INSERT INTO table_original VALUES (1,1,1);
INSERT INTO table_original VALUES (2,101,2);
INSERT INTO table_original VALUES (3,3,3);
INSERT INTO table_original VALUES (null,4,4);
COMMIT;

--Common OLTP Report Usage, Query SQL is like 
SELECT A.ID ID,SUM(A.Measure1) Measure1,SUM(A.Measure2) Measure2
  FROM table_original A,table_dimension B
 WHERE A.ID = B.ID
 GROUP BY A.ID;
--The Result is below:
  ID       Measure1 Measure2
  1        1        1
  2        101      2
So we can get the count value of 1 and 2 in ID field only, it means two
Records are lost, in fact the Measure1 value of ID 2 is out of scope.
--If in OLAP report, we will make Data Clean
--If ID is NULL, we will use -1 replace it
--If ID is not exist in original table, we will use -2 replace it
--We also need check the validity and scope of measures field, we will 
--give a default value or average value, it depend on our need.
--So query SQL is like
SELECT  -- A.ID,B.ID,    -- the transformed field names ID
  DECODE(NVL(A.ID,-2),-2,-2, DECODE(NVL(B.ID,-1),-1,-1,B.ID)) AS ID,
  SUM(CASE A.Measure1>0 AND A.Measure1<100 THEN Measure1 ELSE 0 END
     ) Measure1,
  SUM(CASE A.Measure2>0 AND A.Measure2<100 THEN Measure2 ELSE 0 END
     ) Measure2,
FROM table_original A,table_dimension B
WHERE A.ID = B.ID(+)
GROUP BY DECODE(NVL(B.ID,-2),-2,-2, DECODE(NVL(B.ID,-1),-1,-1,B.ID))

So we will get the result below

加入校验过于复杂，我们也可以求助用户自定义函数或者存储过程进行数据清洗。

第1页：数据质量和清洗第2页：原因分析和解决思路第3页：处理对策第4页：处理脚本第5页：可度量性

关注我们