使用python对xml文件进行解析的时候,假设xml文件的头文件是utf-8格式的编码,那么解析是ok的,但假设是其它格式将会出现例如以下异常:
xml.parsers.expat.ExpatError: unknown encoding
因此,为了保证程序的正常执行,我们须要对读取的文件进行编码处理。
1、首先将读取的字符从原来的编码解析,并编码成utf-8;
2、改动xml的encoding;
代码例如以下:
import sysimport osimport datetimeimport timeimport stringfrom urllib import unquoteimport MySQLdbimport xml.parsers.expatimport xml.etree.ElementTree as Etreeimport typesimport httplibimport urllib2import urllibimport jsonimport redef readDataFromNetwork(url): req = urllib2.Request(url) rd = urllib2.urlopen(req) readData = rd.read() return readData# def parseXmlStr(_str): try: # 将字符串进行解码编码 _str = unquote(_str) _str = _str.decode('gbk').encode('utf-8') print _str[0:100] except Exception,ex: print 'error' # 改动xml文件的编码方式 _str = re.sub('gbk', 'utf-8', _str) xmlDoc = Etree.fromstring(_str) childList = xmlDoc.getchildren() for node in childList: str_value = node.find("display/url").text if str_value.find('CDATA') != -1: print 'haha'
输出结果例如以下:
百日咳 <?xml version="1.0" encoding="utf-8" ?> 百日咳