四、 处理脚本
实际上对于NULL值,缺乏一致性,数据范围的处理,可以通过简单的语句来完成。
Example:
-- Create a original table, only include ID and two measure fields.
-- here ID is one of attribute columns only, so it can be Null values.
-- two measure fields have no constraint or check rule, they may be
-- NULL or out of Scope, on the assumption that the scope value of
-- measure1 and measure2 field is between 0 and 100.
Create table table_original
(
ID INT,
Measure1 LONG,
Measure2 LONG
)
-- Create a dimension table, only include ID and Name fields.
CREATE TABLE table_dimension
(
ID INT CONSTRAINT Pri_Table_Dimension PRIMARY KEY,
Name VARCHAR2 (20)
)
-- Create a fact table, Only include ID and two Measure. It will save
-- compute or count result.
Create table table_fact
(
ID INT NOT NULL,
Measure1 LONG,
Measure2 LONG
)
--Insert a sample data into dimension table and original table.
INSERT INTO table_dimension VALUES (1,'a');
INSERT INTO table_dimension VALUES (2,'b');
COMMIT;
INSERT INTO table_original VALUES (1,1,1);
INSERT INTO table_original VALUES (2,101,2);
INSERT INTO table_original VALUES (3,3,3);
INSERT INTO table_original VALUES (null,4,4);
COMMIT;
So we will get the result below--Common OLTP Report Usage, Query SQL is like
SELECT A.ID ID,SUM(A.Measure1) Measure1,SUM(A.Measure2) Measure2
FROM table_original A,table_dimension B
WHERE A.ID = B.ID
GROUP BY A.ID;
--The Result is below:
ID Measure1 Measure2
1 1 1
2 101 2
So we can get the count value of 1 and 2 in ID field only, it means two
Records are lost, in fact the Measure1 value of ID 2 is out of scope.
--If in OLAP report, we will make Data Clean
--If ID is NULL, we will use -1 replace it
--If ID is not exist in original table, we will use -2 replace it
--We also need check the validity and scope of measures field, we will
--give a default value or average value, it depend on our need.
--So query SQL is like
SELECT -- A.ID,B.ID, -- the transformed field names ID
DECODE(NVL(A.ID,-2),-2,-2, DECODE(NVL(B.ID,-1),-1,-1,B.ID)) AS ID,
SUM(CASE A.Measure1>0 AND A.Measure1<100 THEN Measure1 ELSE 0 END
) Measure1,
SUM(CASE A.Measure2>0 AND A.Measure2<100 THEN Measure2 ELSE 0 END
) Measure2,
FROM table_original A,table_dimension B
WHERE A.ID = B.ID(+)
GROUP BY DECODE(NVL(B.ID,-2),-2,-2, DECODE(NVL(B.ID,-1),-1,-1,B.ID))

加入校验过于复杂,我们也可以求助用户自定义函数或者存储过程进行数据清洗。