4.数据压缩测试
表和索引压缩是SQL Server 2008起才具有的功能,有行压缩和页压缩2种选项。只有在Enterprise 版和 Developer 版中才能使用压缩。
对表进行压缩有2种方法,一种是在创建表的时候指定压缩属性,再加载数据,加载数据又分为从外部文本导入和从数据库另一个表复制二种;另一种是对已经包含数据的非压缩表进行修改压缩属性的操作,然后重新整理表。另外SQL Server也提供了一个存储过程,可以预测一个表压缩后的大小,供人作决定前参考。
预测压缩率的存储过程sp_estimate_data_compression_savings的语法为:
[ @schema_name = ] 'schema_name'
, [ @object_name = ] 'object_name'
, [@index_id = ] index_id
, [@partition_number = ] partition_number
, [@data_compression = ] 'data_compression'
其中5个参数分别是模式名、对象名,索引的 ID、分区号,要评估的压缩的类型(NONE、ROW和PAGE),不存在的参数用NULL表示。
1> use tpch;
2> go
1> set statistics time on
2> go
1> select * into tpch.dbo.l from lineitem --从压缩表复制到非压缩表
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 2 ms.
SQL Server Execution Times:
CPU time = 236109 ms, elapsed time = 239178 ms.
(59986052 rows affected)
1> sp_spaceused 'L'
2> go
name rows reserved data index_size unused
----- ---------- ------------------ ------------------ ------------------ ----------
l 59986052 8280224 KB 8280184 KB 32 KB 8 KB
1> set statistics time off
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
1> drop table l;
2> go
1> set statistics time on
2> go
1>
2> select * into tpch.dbo.l from master.dbo.lineitem --从非压缩表复制到非压缩表
3> go
SQL Server parse and compile time:
CPU time = 8 ms, elapsed time = 8 ms.
SQL Server Execution Times:
CPU time = 107609 ms, elapsed time = 109807 ms.
(59986052 rows affected)
1> set statistics time off
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
1> sp_spaceused 'L'
2> go
name rows reserved data index_size unused
----- ---------- ------------------ ------------------ ------------------ ----------
l 59986052 8280216 KB 8280184 KB 24 KB 8 KB
1> drop table L;
2> go
1> select * into tpch.dbo.l from master.dbo.lineitem where 1=2; --复制表结构
2> go
(0 rows affected)
1> sp_spaceused 'L'
2> go
name rows reserved data index_size unused
----- ---------- ------------------ ------------------ ------------------ ----------
l 0 80 KB 8 KB 16 KB 56 KB
1> alter table tpch.dbo.l rebuild WITH (DATA_COMPRESSION =page); --修改为页压缩属性
2> go
1> set statistics time on
2> go
1> insert into tpch.dbo.l with (tablock) select * from master.dbo.lineitem; --用insert with (tablock) 大批量插入压缩表
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 479813 ms, elapsed time = 480804 ms.
(59986052 rows affected)
1> set statistics time off
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
1> sp_spaceused 'L'
2> go
name rows reserved data index_size unused
----- ---------- ------------------ ------------------ ------------------ ----------
l 59986052 4404056 KB 4403976 KB 24 KB 56 KB
1> truncate table L;
2> go
1> set statistics time on
2> go
1> BULK INSERT L FROM 'd:\soft\lineitem.tbl' WITH ( FIELDTERMINATOR = '|',ROWTERMINATOR = '|\n'); --用BULK insert大批量插入压缩表
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 17 ms.
SQL Server Execution Times:
CPU time = 1141437 ms, elapsed time = 1223991 ms.
(59986052 rows affected)
1> set statistics time off
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
1> truncate table L;
2> go
1> alter table tpch.dbo.l rebuild WITH (DATA_COMPRESSION =none);
2> go
1> set statistics time on
2> go
1> BULK INSERT L FROM 'd:\soft\lineitem.tbl' WITH ( FIELDTERMINATOR = '|',ROWTERMINATOR = '|\n'); --用BULK insert大批量插入非压缩表
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 15 ms.
SQL Server Execution Times:
CPU time = 1032969 ms, elapsed time = 1130748 ms.
(59986052 rows affected)
1> set statistics time off
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
1> sp_spaceused 'L'
2> go
name rows reserved data index_size unused
----- ---------- ------------------ ------------------ ------------------ ----------
l 59986052 8282152 KB 8280184 KB 40 KB 1928 KB
1> alter table tpch.dbo.l rebuild WITH (DATA_COMPRESSION =page); --用rebuild从非压缩表转为压缩
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 475936 ms, elapsed time = 60577 ms.
1> sp_spaceused 'L'
2> go
name rows reserved data index_size unused
----- ---------- ------------------ ------------------ ------------------ ----------
l 59986052 4404696 KB 4404392 KB 24 KB 280 KB
1> alter table tpch.dbo.l rebuild WITH (DATA_COMPRESSION =none);--用rebuild从压缩表转为非压缩
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 367218 ms, elapsed time = 52398 ms.
--用sp_estimate_data_compression_savings估计按行压缩后的表大小
1> sp_estimate_data_compression_savings 'DBO' ,'L',NULL,NULL,'ROW'
2> go
object_name
schema_name
partition_number size_with_current_compression_setting(KB) size_with_requested_compression_setting(KB)
sample_size_with_current_compression_setting(KB) sample_size_with_requested_compression_setting(KB)
--------------------------------------------------
L
DBO
1 4406080 5450120
41088 50824
2> set statistics time on
3> go
1> alter table tpch.dbo.l rebuild WITH (DATA_COMPRESSION =row)
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 3 ms.
SQL Server Execution Times:
CPU time = 341969 ms, elapsed time = 45206 ms.
1> set statistics time off
2> go
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
1> sp_spaceused 'L'
2> go
name rows reserved data index_size unused
----- ---------- ------------------ ------------------ ------------------ ----------
l 59986052 5567656 KB 5567368 KB 40 KB 248 KB
从测试结果可知,同样的LINEITEM表数据,在数据库中复制最快,而其中从非压缩表到非压缩表又比其他方式快,只要109秒,读压缩表写入非压缩表多了解压的操作,慢了100多秒,读非压缩表插入压缩表多了压缩的操作,又多了200多秒。如果用bulk insert用时大约1100多秒,插入非压缩表也比插入压缩表快100多秒,差距没有前面数据库内操作明显。对已有数据的表进行压缩或解压缩rebuild操作的速度都很快,从CPU时间和实际时间的比较可知,这个操作是并行处理的,因此比其他操作优势明显。用页方式压缩比行方式的压缩率高,但压缩的速度是行方式高,sp_estimate_data_compression_savings的抽样估计还是比较准确的。
压缩对不同的表效果不同,下面是压缩前后TPC-H数据的对比。
表1 各种压缩类型的大小比较,单位:KB
表2 各种压缩类型的大小比较,单位:以原始未压缩大小为1的比率
从上面表的占用空间可见,对于TPC-H数据,因为dbgen生成的数据比较随机,又是符合第3范式的,冗余较少,压缩的效果不太明显。最大压缩约为原来的1/2,压缩率较低,有些表如PARTSUPP,基本上没有被压缩。而且,如果向具有压缩属性的表导入数据,无论采用什么方式,导入时间都大大长于导入非压缩表,因此,像TPCH这样大量数据的导入,还是首先导入非压缩表,再用alter table rebuild命令压缩比较合适。
聚集主键增加的空间非常少,这与它的存储和数据在一起有关,非聚集主键的存储和聚集主键有较大的区别,它是和数据表分离的,因此占用空间要超过聚集主键。