本页代码可以在这里下载。

1.使用XPath

全称 XML Path Language,即XML语言路径。
常用规则:
nodename      选取此节点的所有子节点
/                        从当前节点选取直接子节点
//                      从当前结点选取子孙节点
.                        选取当前节点
..                      选取当前结点的父节点
@                   选取属性
xml文件:

代码:

html = etree.parse('test.xml', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

结果:

我们看到,即使缺少标签,也可以自动补全。

所有结点:
# xml test -> all
html = etree.parse('test.xml', etree.HTMLParser())
result = html.xpath('//*')
print(result)

结果:

[
<Element html at 0x1d19a6d9a48>,
<Element body at 0x1d19a6d9b48>,
<Element div at 0x1d19a6d9b88>,
<Element ul at 0x1d19a6d9bc8>,
<Element li at 0x1d19a6d9c08>,
<Element a at 0x1d19a6d9c88>,
<Element li at 0x1d19a6d9cc8>,
<Element a at 0x1d19a6d9d08>,
<Element li at 0x1d19a6d9d48>,
<Element a at 0x1d19a6d9c48>,
<Element li at 0x1d19a6d9d88>,
<Element a at 0x1d19a6d9dc8>,
<Element li at 0x1d19a6d9e08>,
<Element a at 0x1d19a6d9e48>
]
子节点:
# xml test -> son
html = etree.parse('test.xml', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

结果:

[
<Element a at 0x26e0af09b48>,
<Element a at 0x26e0af09b88>,
<Element a at 0x26e0af09bc8>,
<Element a at 0x26e0af09c08>,
<Element a at 0x26e0af09c48>
]

输出了li下所有子节点a。

父节点:
# xml test -> father
html = etree.parse('test.xml', etree.HTMLParser())
result = html.xpath('//a[@herf="link4.html"]/../@class')
print(result)

输出:
[‘item-0’]

属性获取:
# xml test -> get attribute
html = etree.parse('test.xml', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

输出:
[‘link1.html’, ‘link2.html’, ‘link3.html’, ‘link4.html’, ‘link5.html’]

属性多值获取:
# xml test -> get attribute which have more than one values
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

输出:
[‘first item’]

多属性值匹配:
# xml test -> match message by more than one attribute
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")and @name="item"]/a/text()')
print(result)

输出:
[‘first item’]

按序输出:
# xml test -> output by order
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')         # print the first node
print(result)
result = html.xpath('//li[last()]/a/text()')    # print the last node
print(result)
result = html.xpath('//li[position<3]/a/text()')    # print the nodes whose position is smaller than 3
print(result)
result = html.xpath('//li[last()-2]/a/text()')  # print the antepenultimate node
print(result)
节点轴选择:
# xml test -> node axle
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')
print(result)

第一次选择:使用ancentor轴,可以获取所有祖先节点。
第二次选择:返回div的祖先节点。
第三次选择:调用attribute轴,可以获取所有属性值。
第四次选择:调用child轴,可以获取所有直接子节点(这里加了限制条件)。
第五次选择:调用descendant轴,可以获取所有子孙节点。
第六次选择:调用following轴,可以获取当前结点之后的所有节点。
第七次选择:调用following-sibling轴,可以获得当前节点之后的所有同级节点。

2.使用Beautiful Soup

是一个Python的HTML、XML解析库,用它可以方便的从网页中某个元素中提取数据。

基本用法:

# beautiful soup test
html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters;and their names were
            <a href="http://www.baidu.com/01" class="sister" id="link1"><!--Elsie--></a>
            <a href="http://www.baidu.com/02" class="sister" id="link2">Lacie</a> and
            <a href="http://www.baidu.com/03" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

运行结果:

我们可以看到,他会自动把我们没有关闭的标签关闭并以标准的缩进形式输出(这一步在生成beautiful soup对象的时候就完成了),然后调用soup.title.string(节点选择器),实际上是输出HTML文本中title节点的文本内