【网络爬虫实战】异步爬虫

最近崔庆才前辈第二版网络爬虫实战出来了，赶紧买了个签名版的膜拜一下，接下来一段时间对于书中所讲述的技术进行查漏补缺。

关于协程的一些基础用法详见【python】协程Coroutines。

首先一个链接，该链接会延时5秒响应请求http://www.httpbin.org/delay/5

我们就用这个链接模拟爬虫中的响应问题。

首先，我们需要一个支持异步请求的库aiohttp来搭配asyncio完成我们的异步爬虫。

pip install aiohttp

下面我们将投入实战：

import time
import aiohttp
import asyncio
async def get(url):
    session = aiohttp.ClientSession()
    response = await session.get(url)
    await response.text()
    await session.close()
    return response
async def request():
    url = 'http://www.httpbin.org/delay/5'
    print('waiting for', url)
    response = await get(url)
    print('get response from', url, 'response', response)
start = time.time()
tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)

我们可以看到，总用时只有5.4秒，说明他们并不是“顺序”运行。

waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
waiting for http://www.httpbin.org/delay/5
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
get response from http://www.httpbin.org/delay/5 response <ClientResponse(http://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Mon, 06 Dec 2021 06:21:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
5.479228973388672

这里我们使用了await，其后跟着get方法。在执行协程时，如果遇到了await，就会将当前协程挂起，转而执行其他协程，直到其他协程也挂起或执行完毕，在执行下一个协程。

await 后面只能跟异步程序或有__await__属性的对象，因为异步程序与一般程序不同。

这也就是为什么await session.get(url)后都要进行await声明。这里是如何运行的呢？当程序第一次运行到await get(url)，该任务会被挂起，同时由于get()并非是一个阻塞任务，又会被立刻唤醒，随后创建ClientSession，执行session.get请求，随后被挂起（因为session.get是阻塞的，且请求需要耗时很久，所以没有被唤醒）。随后事件循环会继续寻找未被挂起的协程继续运行，于是执行第二个task，以此类推。直至最后一个task也被挂起，只好等待。直至请求有响应（返回），task被唤醒继续执行，输出结果。

同时，我们还可以通过aiohttp.ClientTimeout定义超时时间，aiohttp.Semaphore定义最大并发数量等。

实战：https://spa5.scrape.center/

【网络爬虫实战】异步爬虫

于2021-12-06由admin发布

0 条评论

发表回复取消回复

网络爬虫实战

【爬虫爬虫实战】Selenium过检测

网络爬虫实战

【python爬虫】各大网站反爬手段汇总

网络爬虫实战

【网络爬虫实战】大众点评评论页面字体加密与位置偏移

【网络爬虫实战】异步爬虫

于2021-12-06由admin发布

0 条评论

发表回复 取消回复

相关文章

网络爬虫实战

【爬虫爬虫实战】Selenium过检测

网络爬虫实战

【python爬虫】各大网站反爬手段汇总

网络爬虫实战

【网络爬虫实战】大众点评评论页面字体加密与位置偏移

发表回复取消回复