最近崔庆才前辈第二版网络爬虫实战出来了,赶紧买了个签名版的膜拜一下,接下来一段时间对于书中所讲述的技术进行查漏补缺。
关于协程的一些基础用法详见【python】协程Coroutines。
首先一个链接,该链接会延时5秒响应请求http://www.httpbin.org/delay/5
我们就用这个链接模拟爬虫中的响应问题。
首先,我们需要一个支持异步请求的库aiohttp来搭配asyncio完成我们的异步爬虫。
pip install aiohttp
下面我们将投入实战:
import time
import aiohttp
import asyncio
async def get(url):
session = aiohttp.ClientSession()
response = await session.get(url)
await response.text()
await session.close()
return response
async def request():
url = 'http://www.httpbin.org/delay/5'
print('waiting for', url)
response = await get(url)
print('get response from', url, 'response', response)
start = time.time()
tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)
我们可以看到,总用时只有5.4秒,说明他们并不是“顺序”运行。
waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
5.479228973388672
这里我们使用了await,其后跟着get方法。在执行协程时,如果遇到了await,就会将当前协程挂起,转而执行其他协程,直到其他协程也挂起或执行完毕,在执行下一个协程。
await 后面只能跟异步程序或有__await__属性的对象,因为异步程序与一般程序不同。
这也就是为什么await session.get(url)后都要进行await声明。这里是如何运行的呢?当程序第一次运行到await get(url),该任务会被挂起,同时由于get()并非是一个阻塞任务,又会被立刻唤醒,随后创建ClientSession,执行session.get请求,随后被挂起(因为session.get是阻塞的,且请求需要耗时很久,所以没有被唤醒)。随后事件循环会继续寻找未被挂起的协程继续运行,于是执行第二个task,以此类推。直至最后一个task也被挂起,只好等待。直至请求有响应(返回),task被唤醒继续执行,输出结果。
同时,我们还可以通过aiohttp.ClientTimeout定义超时时间,aiohttp.Semaphore定义最大并发数量等。
实战:https://spa5.scrape.center/
0 条评论