当前位置：首页> 手游资讯> 正文

带你完成第一个爬虫，简单爬取百度图片

admin 2025-01-09 55

大家好，我是润森

什么是爬虫爬虫协议

Robots协议（也称为爬虫协议、机器人协议等）的全称是“网络爬虫排除标准”（RobotsExclusionProtocol），网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取。

爬虫百度图片

目标：爬取百度的图片，并保存电脑中

能不能爬？

首先数据是否公开？能不能下载？

从图中可以看出，百度的图片是完全可以下载，说明了图片可以爬取

先爬取一张图片

首先，明白图片是什么？

然后需要图片在哪里？

每张图片都有对应的url，通过requests模块来发起请求，在用文件的wb+方式来保存起来

importrequestsr=('')withopen('','wb+')asf:()

批量爬取

网站的分析

首先了解json

json就是js的对象，就是来存取数据的东西

JSON字符串

{“name”:“毛利”,“age”:18,“feature“:[‘高’,‘富’,‘帅’]}

Python字典

{‘name’:‘毛利’,‘age’:18‘feature’:[‘高’,‘富’,‘帅’]}

导入python中json，通过(s)--将json数据转换为python的数据（字典）

ajax的使用

图片是通过ajax方法来加载的，也就是当我下拉，图片会自动加载，是因为网站自动发起了请求，

分析图片url链接的位置

同时找到对应ajax的请求的url

构造ajax的url请求，来将json转化为字典，在通过字典的键值对来取值，得到图片对应的url

importrequestsimportjsonheaders={'User-Agent':'Mozilla/5.0(;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/74.0.3729.131Safari/537.36'}r=(';ipn=rjct=201326592is=fp=resultqueryWord=%E5%9B%BE%E7%89%87cl=2lm=-1ie=utf-8oe=utf-8adpicid=st=-1z=ic=0hd=latest=©right=word=%E5%9B%BE%E7%89%87s=se=tab=width=height=face=0istype=2qc=nc=1fr=expermode=force=pn=30rn=30gsm=1e90=',headers=headers).textres=(r)['data']forindex,iinenumerate(res):url=i['hoverURL']print(url)withopen('{}.jpg'.format(index),'wb+')asf:((url).content)

构造json的url，不断的爬取图片

首先分析不同的json中发起的请求

;ipn=rjct=201326592is=fp=resultqueryWord=%E5%9B%BE%E7%89%87cl=2lm=-1ie=utf-8oe=utf-8adpicid=st=-1z=ic=0hd=latest=©right=word=%E5%9B%BE%E7%89%87s=se=tab=width=height=face=0istype=2qc=nc=1fr=expermode=force=pn=60rn=30gsm=3c55=;ipn=rjct=201326592is=fp=resultqueryWord=%E5%9B%BE%E7%89%87cl=2lm=-1ie=utf-8oe=utf-8adpicid=st=-1z=ic=0hd=latest=©right=word=%E5%9B%BE%E7%89%87s=se=tab=width=height=face=0istype=2qc=nc=1fr=expermode=force=pn=30rn=30gsm=1e90=

其实可以发现，当再次发起请求时，关键就是那个pn在不断的变动

最后封装代码，一个列表来定义生产者来存储不断的生成图片url，另一个列表来定义消费者来保存图片

time：2019/6/2017:07#author:毛利importrequestsimportjsonimportosdefget_pic_url(num):pic_url=[]headers={'User-Agent':'Mozilla/5.0(;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/74.0.3729.131Safari/537.36'}foriinrange(num):page_url=';ipn=rjct=201326592is=fp=resultqueryWord=%E5%9B%BE%E7%89%87cl=2lm=-1ie=utf-8oe=utf-8adpicid=st=-1z=ic=0hd=latest=©right=word=%E5%9B%BE%E7%89%87s=se=tab=width=height=face=0istype=2qc=nc=1fr=expermode=force=pn={}rn=30gsm=1e90='.format(30*i)r=(page_url,headers=headers).textres=(r)['data']ifres:print(res)forjinres:try:url=j['hoverURL']pic_(url)except:print('该图片的url不存在')print(len(pic_url))returnpic_urldefdown_img(num):pic_url=get_pic_url(num)('D:\图片'):passelse:('D:\图片')path='D:\图片\\'forindex,iinenumerate(pic_url):filename=path+str(index)+'.jpg'print(filename)withopen(filename,'wb+')asf:((i).content)if__name__=='__main__':num=int(input('爬取几次图片：一次30张'))down_img(num)