本页代码可以在这里下载。
1.使用XPath
全称 XML Path Language,即XML语言路径。
常用规则:
nodename 选取此节点的所有子节点
/ 从当前节点选取直接子节点
// 从当前结点选取子孙节点
. 选取当前节点
.. 选取当前结点的父节点
@ 选取属性
xml文件:
代码:
html = etree.parse('test.xml', etree.HTMLParser()) result = etree.tostring(html) print(result.decode('utf-8'))
结果:
我们看到,即使缺少标签,也可以自动补全。
所有结点:
# xml test -> all html = etree.parse('test.xml', etree.HTMLParser()) result = html.xpath('//*') print(result)
结果:
[ <Element html at 0x1d19a6d9a48>, <Element body at 0x1d19a6d9b48>, <Element div at 0x1d19a6d9b88>, <Element ul at 0x1d19a6d9bc8>, <Element li at 0x1d19a6d9c08>, <Element a at 0x1d19a6d9c88>, <Element li at 0x1d19a6d9cc8>, <Element a at 0x1d19a6d9d08>, <Element li at 0x1d19a6d9d48>, <Element a at 0x1d19a6d9c48>, <Element li at 0x1d19a6d9d88>, <Element a at 0x1d19a6d9dc8>, <Element li at 0x1d19a6d9e08>, <Element a at 0x1d19a6d9e48> ]
子节点:
# xml test -> son html = etree.parse('test.xml', etree.HTMLParser()) result = html.xpath('//li/a') print(result)
结果:
[ <Element a at 0x26e0af09b48>, <Element a at 0x26e0af09b88>, <Element a at 0x26e0af09bc8>, <Element a at 0x26e0af09c08>, <Element a at 0x26e0af09c48> ]
输出了li下所有子节点a。
父节点:
# xml test -> father html = etree.parse('test.xml', etree.HTMLParser()) result = html.xpath('//a[@herf="link4.html"]/../@class') print(result)
输出:
[‘item-0’]
属性获取:
# xml test -> get attribute html = etree.parse('test.xml', etree.HTMLParser()) result = html.xpath('//li/a/@href') print(result)
输出:
[‘link1.html’, ‘link2.html’, ‘link3.html’, ‘link4.html’, ‘link5.html’]
属性多值获取:
# xml test -> get attribute which have more than one values text = ''' <li class="li li-first"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class, "li")]/a/text()') print(result)
输出:
[‘first item’]
多属性值匹配:
# xml test -> match message by more than one attribute text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class, "li")and @name="item"]/a/text()') print(result)
输出:
[‘first item’]
按序输出:
# xml test -> output by order text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[1]/a/text()') # print the first node print(result) result = html.xpath('//li[last()]/a/text()') # print the last node print(result) result = html.xpath('//li[position<3]/a/text()') # print the nodes whose position is smaller than 3 print(result) result = html.xpath('//li[last()-2]/a/text()') # print the antepenultimate node print(result)
节点轴选择:
# xml test -> node axle text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[1]/ancestor::*') print(result) result = html.xpath('//li[1]/ancestor::div') print(result) result = html.xpath('//li[1]/attribute::*') print(result) result = html.xpath('//li[1]/child::a[@href="link1.html"]') print(result) result = html.xpath('//li[1]/descendant::span') print(result) result = html.xpath('//li[1]/following::*[2]') print(result) result = html.xpath('//li[1]/following-sibling::*') print(result)
第一次选择:使用ancentor轴,可以获取所有祖先节点。
第二次选择:返回div的祖先节点。
第三次选择:调用attribute轴,可以获取所有属性值。
第四次选择:调用child轴,可以获取所有直接子节点(这里加了限制条件)。
第五次选择:调用descendant轴,可以获取所有子孙节点。
第六次选择:调用following轴,可以获取当前结点之后的所有节点。
第七次选择:调用following-sibling轴,可以获得当前节点之后的所有同级节点。
2.使用Beautiful Soup
是一个Python的HTML、XML解析库,用它可以方便的从网页中某个元素中提取数据。
基本用法:
# beautiful soup test html = ''' <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://www.baidu.com/01" class="sister" id="link1"><!--Elsie--></a> <a href="http://www.baidu.com/02" class="sister" id="link2">Lacie</a> and <a href="http://www.baidu.com/03" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> ''' soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)
运行结果:
我们可以看到,他会自动把我们没有关闭的标签关闭并以标准的缩进形式输出(这一步在生成beautiful soup对象的时候就完成了),然后调用soup.title.string(节点选择器),实际上是输出HTML文本中title节点的文本内容。
节点选择器:
选择元素:soup.title.string
获取属性:soup.p.attrs soup.p.attrs[‘name’]
获取内容:soup.p.string
方法选择器:
(1)find_all():
查询所有符合条件的元素。给它传入一些属性或文本,就可以得到符合条件的元素。
find_all(name, attrs, recursive, text, **kwargs)
# beautiful soup test -> find_all html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' soup = BeautifulSoup(html, 'lxml') print(soup.find_all(name='ul')) print() for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) print() print(soup.find_all(attrs={'id': 'list-1'})) print() print(soup.find_all(id='list-1')) print() print(soup.find_all(text=re.compile('Foo')))
运行结果:
(2)find():
和find_all()差不多,只不过前者返回所有匹配元素组成的列表,后者返回单个元素,也就是第一个匹配的元素。
# beautiful soup test -> find_all html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' soup = BeautifulSoup(html, 'lxml') print(soup.find(name='ul'))
执行结果,至返回第一次匹配:
CSS选择器:
使用select()选择结点。
使用[ ]或者attrs[ ]获取属性。
使用.get_text[ ]或者。string获取文本。
代码:
# beautiful soup test -> find html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print() print(soup.select('ul li')) print() print(soup.select('#list-2 .element')) print() for ul in soup.select('li'): # attribute print(ul['class']) print(ul.attrs['class']) print('Get Text:', ul.get_text()) print('String:', ul.string)
结果:
3.使用pyquery
比起前两种,pyquery可以从URL和文件进行初始化。
# pyquery test html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' doc = pq(html) print(doc('li')) print() doc = pq(url='http://www.sniper97.cn') # use url print(doc('title')) print() doc = pq(filename='test.xml') # use file print(doc('li')) print()
结果:
基本CSS选择器:
# pyquery test -> css html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body" id="list-0"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' doc = pq(html) print(doc('#list-0 .list li'))
其中,list-0 是选取id为list-0的节点,然后再选取其内部的class为list的节点内部所在的li节点。
输出:
查找节点:
子孙节点:find()
父节点:parent()、parents()
兄弟节点:siblings()
# pyquery test -> find node html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body" id="list-0"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' doc = pq(html) item = doc('ul') print(item) print() lis = item.find('li') # son node print(lis) print() par = item.parents() # parents node print(par) print() par = item.parents('.panel-body') # parent node , only one point print(par) print() node = doc('li') print(node.siblings('.element')) # find brother node
输出:
获取信息:
获取属性:
对于多组数据的需要使用迭代器才能正确输出,.text()仅能输出文本,而.html()可以输出html代码。
# pyquery test -> find node html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body" id="list-0"> <ul class="list" id="list-1" name="elements"> <li class="element1"><a href="www.123.com">Foo</li> <li class="element2"><a href="www.123.com">Bar</li> <li class="element3"><a href="www.123.com">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element4"><a href="www.123.com">Foo</li> <li class="element5"><a href="www.123.com">Bar</li> </ul> </div> </div> ''' doc = pq(html) item = doc('ul') print(item.attr('class')) print(item.attr.id) item = doc('li') print(item.attr('class')) # can not output attr for i in item.items(): print(i.attr('class')) # can output attr print(i.text()) # can output text print(i.html()) # can output html
输出:
节点操作:
提供方法进行动态操作,允许为某个节点添加一个class,移除某个点等。
addClass和removeClass
# pyquery test -> node handle html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body" id="list-0"> <ul class="list" id="list-1" name="elements"> <li class="element1"><a href="www.123.com">Foo</li> <li class="element2"><a href="www.123.com">Bar</li> <li class="element3"><a href="www.123.com">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element4"><a href="www.123.com">Foo</li> <li class="element5"><a href="www.123.com">Bar</li> </ul> </div> </div> ''' doc = pq(html) ul = doc('ul') print(ul) print() ul.add_class('action') print(ul) print() ul.remove_class('action') print(ul) print()
.输出:
attr、text、html
除了可以对class属性进行操作之外,还可以使用attr对属性进行操作,使用text和html对文本以及html部分进行操作。
# pyquery test -> attribute handle html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body" id="list-0"> <ul class="list" id="list-1" name="elements"> <li class="element1"><a href="www.123.com">Foo</li> <li class="element2"><a href="www.123.com">Bar</li> <li class="element3"><a href="www.123.com">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element4"><a href="www.123.com">Foo</li> <li class="element5"><a href="www.123.com">Bar</li> </ul> </div> </div> ''' doc = pq(html) ul = doc('ul') print(ul) print() ul.attr('name', 'sniper') print(ul) print() ul.text('change item') print(ul) print() ul.html('<a href="www.123.com">') print(ul) print()
输出,可以看到name属性被添加,text被替换,html被替换。(后面的子节点也被替换了):
remove():
可以移除一个节点:.find(‘ xxx’).remove( )
伪类选择器:
可以选择第一个节点、最后一个节点、奇偶数结点、包含某一文本的节点等。
# pyquery test -> attribute handle html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body" id="list-0"> <ul class="list" id="list-1" name="elements"> <li class="element1"><a href="www.123.com">Foo</li> <li class="element2"><a href="www.123.com">Bar</li> <li class="element3"><a href="www.123.com">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element4"><a href="www.123.com">Foo</li> <li class="element5"><a href="www.123.com">Bar</li> </ul> </div> </div> ''' doc = pq(html) li = doc('li:first-child') print(li) print() li = doc('li:last-child') print(li) print() li = doc('li:nth-child(2)') print(li) print() li = doc('li:gt(2)') print(li) print() li = doc('li:nth-child(2n)') print(li) print() li = doc('li:contains(Bar)') print(li) print()
输出:
.
0 条评论