欢迎光临
我们一直在努力

python Scrapy框架采集微信公众号数据 机智操作绕过反爬验证码

前情提要

此代码使用scrapy框架爬取特定“关键词”下的搜狗常规搜索结果,保存到同级目录下csv文件。并非爬取微信公众号文章,但是绕过验证码的原理相同。如有错误,希望大家指正。

私信小编01即可获取大量Python学习资料

URL结构

https://www.sogou.com/web?query={关键词}&page={n}

开始爬取

scrapy常规操作就不唠叨了,上代码

class SougouSearchSpider(scrapy.Spider):
    name = 'sogou_search'
    allowed_domains = ['www.sogou.com']
    start_urls = ['https://www.sogou.com/web?query=python&page=1']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url,
                                 callback=self.parse)

一顿操作后,发现刚爬了3页,就停止了。报错如下

2020-06-11 16:05:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sogou.com/antispider/?from=%2Fweb%3Fquery%3Dpython%26page%3D7%26ie%3Dutf8&suuid=6969bcac-7309-410c-b2f0-67d2e882e8a5&m=1> from <GET https://www.sogou.com/web?query=python&page=4&ie=utf8>

应该访问的第四页被302重定向到了antispider反爬虫的验证码界面

 


看到这里,我们采取一下常规措施:
1.更换动态IP
2.伪装header
3.硬刚破解验证码

一步一步来,我们先更换IP。这里我使用的是使用Redis的免费代理IP池:ProxyPool,Github地址。非常nice,代码十分简练。

import requests
proxypool_url = 'http://127.0.0.1:5555/random'
def get_random_proxy():
    response = requests.get(proxypool_url)
    try:
        if response.status_code == 200:
            return response.text.strip()
    except ConnectionError:
        return None

遗憾的是,只换IP不太行,还是会被302到antispider页面。
接下来,我们伪装一下header,先看一下request.header。手动搜索,然后F12查看。

 


user-agent可以由fake_useragent随机生成,Github地址。难的是cookies。

ssuid=6821261448; IPLOC=CN3301; 
SUID=6E4A782A1810990A000000005D1B0530; 
SUV=001E174F2A784A5C5D356F6A8DE34468; 
wuid=AAGgaAPZKAAAAAqLFD1NdAgAGwY=; 
CXID=76D40644BC123D9C02A653E240879D21; 
ABTEST=1|1590639095|v17; browerV=3; osV=1; pgv_pvi=4077461504; 
ad=TWMV9lllll2WYZuMlllllVEYsfUlllllKgo1fZllll9lllllxllll5@@@@@@@@@@; 
pgv_si=s7053197312; clientId=1B608ECCC1D96A9828CC801DC0116B61; 
SNUID=8BAB9BC9E2E6452717FC02C7E3A7749E; 
ld=VZllllllll2W88XflllllVEPcwylllll3XPwWlllllUllllljZlll5@@@@@@@@@@; 
sst0=711; LSTMV=190%2C728; LCLKINT=2808800; sct=36

多访问几个网页,查看参数变化。然后发现重要的参数只有两个

SUV=001E174F2A784A5C5D356F6A8DE34468;
SNUID=8BAB9BC9E2E6452717FC02C7E3A7749E;

然后我就带着这两个参数的cookies去访问了,虽然多爬了2页,但是很快遇到了antispider。

然后我就想我每访问一页,换一组cookies,行不行?

机智操作登场

突然想到,搜狗是一个大家族,我访问其他搜狗页面,比如搜狗视频,应该也能拿到cookies。

def get_new_cookies():
	# 搜狗视频url
    url = 'https://v.sogou.com/v?ie=utf8&query=&p=40030600'
    proxies = {"http": "http://" + get_random_proxy()}
    headers = {'User-Agent': ua}
    rst = requests.get(url=url,
                       headers=headers,
                       allow_redirects=False,
                       proxies=proxies)
    cookies = rst.cookies.get_dict()
    return cookies

果不其然,得到了想要的cookies:

{'IPLOC': 'CN3301', 
'SNUID': '13320250797FDDB3968937127A18F658', 
'SUV': '00390FEA2A7848695EE1EE6630A8C887', 
'JSESSIONID': 'aaakft_AlKG3p7g_ZIIkx'}

多打印了几次,每次的cookies值都不一样。这就OK了,马上放进scrapy。

大功告成

我尝试了每3秒爬一个网页,运行两小时,没触发antispider!

 

 


这样就满足我的需求了,就没有硬刚破解验证码。当然对接第三方打码平台很容易,就不多说了。
如果你嫌弃慢的话,可以缩短时间间隔。

完整代码如下

sougou_search.py

# coding=utf-8
from sougou_search_spider.items import SougouSearchSpiderItem
from IP.free_ip import get_random_proxy
from IP.get_cookies import get_new_cookies,get_new_headers
import scrapy
import time
import random


class SougouSearchSpider(scrapy.Spider):
    name = 'sogou_search'
    allowed_domains = ['www.sogou.com']
    start_urls = ['https://www.sogou.com/web?query=python']

    def start_requests(self):
        headers = get_new_headers()
        for url in self.start_urls:
            # 获取代理IP
            proxy = 'http://' + str(get_random_proxy())
            yield scrapy.Request(url=url,
                                 callback=self.parse,
                                 headers=headers,
                                 meta={'http_proxy': proxy})

    def parse(self, response):
        headers_new = get_new_headers()
        cookies_new = get_new_cookies()
        # 获取当前页码
        current_page = int(response.xpath('//div[@id="pagebar_container"]/span/text()').extract_first())
        # 解析当前页面
        for i, a in enumerate(response.xpath('//div[contains(@class,"vrwrap")]/h3[@class="vrTitle"]/a')):
            # 获取标题,去除空格和换行符
            title = ''.join(a.xpath('./em/text() | ./text()').extract()).replace(' ', '').replace('
', '')
            if title:
                item = SougouSearchSpiderItem()
                # 获取访问链接(①非跳转链接②跳转链接)、页码、行数、标题
                if a.xpath('@href').extract_first().startswith('/link'):
                    item['visit_url'] = 'www.sogou.com' + a.xpath('@href').extract_first()  # 提取链接
                else:
                    item['visit_url'] = a.xpath('@href').extract_first()
                item['page'] = current_page
                item['rank'] = i + 1
                item['title'] = title
                yield item
        # 控制爬取频率
        time.sleep(random.randint(8, 10))
        # 获取“下一页”的链接
        p = response.xpath('//div[@id="pagebar_container"]/a[@id="sogou_next"]')
        if p:
            p_url = 'https://www.sogou.com/web' + str(p.xpath('@href').extract_first())
            proxy = 'http://' + str(get_random_proxy())
            yield scrapy.Request(url=p_url,
                                 callback=self.parse,
                                 headers=headers_new,
                                 cookies=cookies_new,
                                 meta={'http_proxy': proxy})

get_cookies.py

# coding=utf-8
from IP.free_ip import get_random_proxy
from fake_useragent import UserAgent
import requests

ua = UserAgent().random

def get_new_headers():
    headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
              "Accept-Encoding": "gzip, deflate, br",
              "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
              "User-Agent": ua}
    return headers

def get_new_cookies():
    url = 'https://v.sogou.com/v?ie=utf8&query=&p=40030600'
    proxies = {"http": "http://" + get_random_proxy()}
    headers = {'User-Agent': ua}
    rst = requests.get(url=url,
                       headers=headers,
                       allow_redirects=False,
                       proxies=proxies)
    cookies = rst.cookies.get_dict()
    return cookies

if __name__ == '__main__':
    print(get_new_cookies())

free_ip.py

# coding=utf-8
import requests

proxypool_url = 'http://127.0.0.1:5555/random'

def get_random_proxy():
    response = requests.get(proxypool_url)
    try:
        if response.status_code == 200:
            return response.text.strip()
    except ConnectionError:
        return None

if __name__ == '__main__':
    print(get_random_proxy())

setting.py

BOT_NAME = 'sougou_search_spider'

SPIDER_MODULES = ['sougou_search_spider.spiders']
NEWSPIDER_MODULE = 'sougou_search_spider.spiders'

REDIRECT_ENABLED = False
HTTPERROR_ALLOWED_CODES = [302]

ROBOTSTXT_OBEY = False
COOKIES_ENABLED = True

ITEM_PIPELINES = {
    'sougou_search_spider.pipelines.CsvSougouSearchSpiderPipeline': 300,
}
 收藏 (0) 打赏

您可以选择一种方式赞助本站

支付宝扫一扫赞助

微信钱包扫描赞助

未经允许不得转载:英协网 » python Scrapy框架采集微信公众号数据 机智操作绕过反爬验证码

分享到: 生成海报
avatar

热门文章

  • 评论 抢沙发

    • QQ号
    • 昵称 (必填)
    • 邮箱 (必填)
    • 网址

    登录

    忘记密码 ?

    切换登录

    注册

    我们将发送一封验证邮件至你的邮箱, 请正确填写以完成账号注册和激活