技术开发 频道

关于数据仓库数据质量的问题探讨


四、 处理脚本
实际上对于NULL值,缺乏一致性,数据范围的处理,可以通过简单的语句来完成。
Example:
-- Create a original table, only include ID and two measure fields. -- here ID is one of attribute columns only, so it can be Null values. -- two measure fields have no constraint or check rule, they may be -- NULL or out of Scope, on the assumption that the scope value of -- measure1 and measure2 field is between 0 and 100. Create table table_original ( ID INT, Measure1 LONG, Measure2 LONG ) -- Create a dimension table, only include ID and Name fields. CREATE TABLE table_dimension ( ID INT CONSTRAINT Pri_Table_Dimension PRIMARY KEY, Name VARCHAR2 (20) ) -- Create a fact table, Only include ID and two Measure. It will save -- compute or count result. Create table table_fact ( ID INT NOT NULL, Measure1 LONG, Measure2 LONG ) --Insert a sample data into dimension table and original table. INSERT INTO table_dimension VALUES (1,'a'); INSERT INTO table_dimension VALUES (2,'b'); COMMIT; INSERT INTO table_original VALUES (1,1,1); INSERT INTO table_original VALUES (2,101,2); INSERT INTO table_original VALUES (3,3,3); INSERT INTO table_original VALUES (null,4,4); COMMIT;
--Common OLTP Report Usage, Query SQL is like SELECT A.ID ID,SUM(A.Measure1) Measure1,SUM(A.Measure2) Measure2 FROM table_original A,table_dimension B WHERE A.ID = B.ID GROUP BY A.ID; --The Result is below: ID Measure1 Measure2 1 1 1 2 101 2 So we can get the count value of 1 and 2 in ID field only, it means two Records are lost, in fact the Measure1 value of ID 2 is out of scope. --If in OLAP report, we will make Data Clean --If ID is NULL, we will use -1 replace it --If ID is not exist in original table, we will use -2 replace it --We also need check the validity and scope of measures field, we will --give a default value or average value, it depend on our need. --So query SQL is like SELECT -- A.ID,B.ID, -- the transformed field names ID DECODE(NVL(A.ID,-2),-2,-2, DECODE(NVL(B.ID,-1),-1,-1,B.ID)) AS ID, SUM(CASE A.Measure1>0 AND A.Measure1<100 THEN Measure1 ELSE 0 END ) Measure1, SUM(CASE A.Measure2>0 AND A.Measure2<100 THEN Measure2 ELSE 0 END ) Measure2, FROM table_original A,table_dimension B WHERE A.ID = B.ID(+) GROUP BY DECODE(NVL(B.ID,-2),-2,-2, DECODE(NVL(B.ID,-1),-1,-1,B.ID))
So we will get the result below 
 
  加入校验过于复杂,我们也可以求助用户自定义函数或者存储过程进行数据清洗。
0
相关文章