
pyhon爬取发表情包网站 单进程/多jin'c
pyhon爬取发表情包网站 单进程/多jin'c
在家闲着,爬爬表情包网站玩,本计划写个单进程的爬取,但是爬取速度有点慢,最后改成了多进程。以下附上单进程代码,多进程代码在附件里面,有需要自己拿取。希望大家在爬取的时候手下留情,下面是测试的图片。
-- coding:utf8 --
import requests import re from urllib.request import urlretrieve import os headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36' } def get_urls(n): url = 'https://www.fabiaoqing.com/bqb/lists/type/hot/page/{}.html'.format(n) response = requests.get(url,headers=headers) html = response.text.encode('gbk','ignore').decode('gbk','ignore') part1 = 'data-original="(.*?)" alt' part2 = '" alt="(.*?)" style="' img_urls = re.compile(part1).findall(html) titles = re.compile(part2).findall(html) # print(img_urls) # print(titles) urls = [] for i in range(0,len(titles)): url = img_urls[i] title = titles[i] urls.append((url,title)) print(urls) return urls def download(data): data = list(data) url = data.copy() print(data) for k in range(len(data)): file = file = os.path.abspath(os.path.dirname(__file__))+'/Emoji/' if not os.path.exists(file): os.makedirs(file) print('--创建成功--') te = data[k][0][-4:] # print(te) title = data[k][0][-4:].split("'") title1 = data[k][1].replace(' ', '') title1 = title1.replace(' ', '') # print(title1) try: path = os.path.abspath(os.path.dirname(__file__))+'/Emoji/'+ str(title1)+str(te) print(path) response = requests.get(url[k][0],headers=headers) # with open(path,'wb',encoding='utf-8') as f: # f.write(response.content) print(url[k][0]) urlretrieve(url[k][0],path) print('下载成功') except Exception as err: print(err) if __name__ == '__main__': key = input('输入爬取页数:') for n in range(1,eval(key)+1): urls = get_urls(n) download(urls)
本文最后更新于2020年2月10日,若涉及的内容可能已经失效,直接留言反馈补链即可,我们会处理,谢谢
常见问题FAQ
- 1.关于新手解压出错 必看(附电脑+安卓WINRAR APP)
- 新手必看 本站资源解压教程:http://www.52cgzys.com/76304/
- 2.本站Telegram群组链接
- 3.所有礼包码下载地址:http://www.52cgzys.com/422289/
- 4.各类问题及解决处理方法合集