批量html转text的实现方法-技术开发专区

批量html转text的实现方法

作者：ITPUB论坛编辑： nancy 2008-03-20 19:46

【IT168技术文档】

原来的代码是参考“Recipe 12.11. Using MSHTML to Parse XML or HTML”，利用htmlfile提取文本。
将当前目录下的所有html文件转换为text文件。

def extractHtmlFile(htmlFilePath):
’’’Extract html text and save to text file.
’’’
htmlData = file(htmlFilePath, ’r’).read()
import win32com.client
html = win32com.client.Dispatch(’htmlfile’)
html.writeln(htmlData)
text = html.body.innerText.encode(’gbk’, ’ignore’)

但是发现MSHTML解析文件可能会出错，造成文本提取失败。

jigloo经过对10W+个html文件的测试，得出结论，htmlfile的容错比InternetExplorer.Application要差很多。
他的代码大致如下，IE使用稍烦：

#!/usr/bin/env python
import sys, os, re, codecs
import time
import win32com.client
class htmlfile:
def __init__(self):
self.__ie = win32com.client.Dispatch(’InternetExplorer.Application’)
self.__ie.Silent = True
self.__filename  = ’’
self.__document  = None
def __del__(self):
self.__ie.Quit()
def __getdocument(self, filename):
filename = os.path.abspath(filename)
if self.__filename != filename:
self.__filename = filename
self.__ie.Navigate2(filename)
self.__ie.Document.close()
while self.__ie.Document.Body is None:
time.sleep(0.1)

关注我们