meShell

全干工程师

2023/05/31 21:24

python爬虫之自动登录discuz!刷分

最近看论坛比较多，想提高在论坛的等级，就寻思着写个每天自动刷分的脚本。下面我们就从零开始用python实现一个自动登录，自动访问空间的脚本。我们就以https://www.hostloc.com/作为我们的实验对象。

环境要求

我们需要一个python3的执行环境，还有python包管理器pip,针对实现整个功能我们需要两个等三方的包urllib3和BeautifulSoup4。


# pip 不是环境变量
meshell@python# python3 -m pip install urllib3 BeautifulSoup4

# pip 是环境变量
meshell@python# pip install urllib3 BeautifulSoup4

基础定义

我们需要定义一个简单的类, 有username, password, userAgent, host, identity_cookie_name, cookies的一些属性。
我们把username, password, host, identity_cookie_name作为构造参数传入。


class HostLoc(object):
    username = None # 登录用户名
    password = None # 登录密码
    userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0" # 发送请求的userAgent头
    cookies = {}  # 记录所有请求的cookie，定义成map
    identity_cookie_name = None # 记录登录成功的cookie名称
    host = None # 网址host

    def __init__(self, username, password, host, identity_cookie_name):
        self.username = username
        self.password = password
        self.host = host
        self.identity_cookie_name = identity_cookie_name

实现登录

在实现登录之前需要了解下urllib3的使用，通过这个库来发送http请求官方文档。为类实现一个公共的发送http方法，因为我们对同一个站点发送请求，基本cookie和header每次都是一样的。


...
host = None

def __init__(self, username, password, host, identity_cookie_name):
    ...
    self.http = urllib3.PoolManager(cert_reqs=ssl.CERT_NONE, assert_hostname=False)
    ...

def _request(self, method, url, fields = None):
    headers = {
        "origin": self.host,
        "referer": self.referer,
        "User-Agent": self.userAgent,
    }
    if len(self.cookies) > 0:
        headers['cookie'] = self.joinCookies()

    response = self.http.request(method, url, fields, headers)

    cookies = self.parseCookie(response.getheader('Set-Cookie'))
    if len(cookies) > 0:
        self.cookies.update(cookies)

    return response
...

Note: 如果你想要去掉urllib3的https验证，你必须设置cert_reqs=ssl.CERT_NONE, assert_hostname=False这两个属性

因为urllib3头都是字符串形式，我们的cookies是定义成map形式，我们需要实现一个方法为它转换成cookie头形式。


def joinCookies(self):
    cookie_string = ""
    for key, value in self.cookies.items():
        cookie_string += key + "=" + value + ";"
    return cookie_string.rstrip(";")

解析Cookie是整个请求的中最重要的一步，当登录成功的时候需要记录所有服务端发送的cookie, 下请求下一次页面是需要把这些cookie发送给服务端。看上面的请求方法，我们是通过一个parseCookie的方法来解析cookie，来看看它是怎么实现的。


def parseCookie(self, cookie = None):
    cookies = {}
    if cookie == None:
        return cookies
    for value in  cookie.split(";"):
        hash = value.split("=")
        if len(hash) < 2 or hash[0].strip(" ") in ["expires", "Max-Age", "path"]:
            continue
        name = hash[0]
        index = hash[0].find(',')
        if index != -1:
            name = name[index+1:].lstrip(" ")

        cookies[name] = hash[1]
    return cookies

上面的代码只解析了cookie的名字和值, 不需要过期时间和路径这些。

登录的实现现有的代码里只需要发送请求，判断登录成功的cookie有没有即可。

...

loginFromUrl = "/member.php?mod=logging&action=login&loginsubmit=yes&infloat=yes&lssubmit=yes&inajax=1"

def login(self):

    response = self._request("post", self.loginFromUrl, fields={
        "fastloginfield": "username",
        "username": self.username,
        "password": self.password,
        "quickforward": "yes",
        "handlekey": "ls"
    })

    if response.status == 400:
        print("服务器已限制")
        return False

    if self.identity_cookie_name in self.cookies:
        print("登录成功")
        return True
    return False

用户信息

在成功登录之后就可以获取当前用户的积分，威望，金钱等信息。使用BeautifulSoup4库来匹配页面的html元素，它就像javascript的jQuery库一样获取DOM，就连API也非常相似，你可以官方文档来查看基本的使用。定义一个方法来打印当前的用户信息。随便定义一个主题页面作为访问入口。


referer = "/forum-45-1.html"

creditUrl = "/home.php?mod=spacecp&ac=credit&showcredit=1&inajax=1&ajaxtarget=extcreditmenu_menu"

def __init__(self, username, password, host, identity_cookie_name):
    ...
    self.referer = self.host + self.referer
    self.creditUrl = self.host + self.creditUrl
    ...

def info(self):
    response = self._request(
        "post",
        self.referer
    )
    bs = BeautifulSoup(response.data, "lxml")

    score = bs.find("a", id="extcreditmenu").string
    name = bs.find("strong", class_="vwmy").string

    menu_response = self._request("get", self.creditUrl)
    menu_response_bs = BeautifulSoup(menu_response.data, "lxml")

    hcredit_1 = menu_response_bs.find("span", id="hcredit_1").string
    hcredit_2 = menu_response_bs.find("span", id="hcredit_2").string

    print("昵称: %s, %s\n威望: %s, 金钱: %s" % (name, score, hcredit_1, hcredit_2))

python-discuz!

上图的内容就是通过BeautifulSoup4获取出来的，我们不必自己写正则来获取。我们只使用了一个find来获取指定的内容，此函数也只会返回匹配的一条元素。

积分

在discuz!中获取积分有多很种方式，访问他人主页、发表帖子、回复帖子、每日登录等等都可以获取促积分。我们只实现其中最简单一个访问他人主页，访问他人主页会自动加上积分。


def visitProfile(self, url):
    response = self._request("get", url)
    print(url + "\n")
    bs = BeautifulSoup(response.data, "lxml")
    self.visit_loop += 1

    all = bs.find_all("a", class_="avt")

    visit_len = len(all)
    print(visit_len)
    if visit_len > 1 and self.visit_loop < 20:
        index = random.randint(2, visit_len - 1)
        self.visitProfile(self.host + "/" + all[index]['href'])

我们将此方法加入到获取信息方法中，以信息页面中第一个用户作为访问入口，之后通过他最近访问的人随机一个作为访问入口，20次作为他的访问上限。

def info(self):
    ...
    first = self.host + "/" + bs.find("a", class_="notabs")["href"]
    print("昵称: %s, %s\n威望: %s, 金钱: %s" % (name, score, hcredit_1, hcredit_2))
    self.visitProfile(first)

到此为止，基本流程已经走通。现在需要将获取的cookie写入到文件中，以保证下次执行不需要再次执行登录操作。我们将写入操作放在类的析构阶段，同时也需要在构造函数中读取cookie。


def __init__(self, username, password, host, identity_cookie_name):
    ...    
    self.open_file = open('./cookie.txt', 'w')
    self.origin_cookies = self.readCookie()
    if self.origin_cookies != None:
    for value in self.origin_cookies.split(";"):
        hash = value.split("=")
        self.cookies[hash[0]] = hash[1]

def writeCookie(self, cookie):
    self.open_file.write(cookie)
    self.open_file.close()

def readCookie(self):
    if not pathlib.Path("./cookie.txt").exists():
        return None
    with open('cookie.txt', 'r') as f:
        return f.read()

def __del__(self):
    if len(self.cookies) > 0:
        self.writeCookie(self.joinCookies())

你可能会问为什么在构造函数里面去打开文件，而不是在writeCookie里面去执行open操作。因为python在__del__是不可以执行open操作的。我们已完成了整个操作。

python爬虫之自动登录discuz!刷分

环境要求

基础定义

实现登录

用户信息

积分

推荐阅读

Table Of Contents

meShell 最受欢迎程序解答

PHP单元测试基础教程

如何选择哪种Kubernetes apiVersion?

Kubernetes入门之本地集群minikube

Kubernetes入门之微服务实践

Kubernetes网络之Pods

gitlab ci permission denied (publickey gssapi-keyex gssapi-with-mic password)

在Go中处理body为JSON的HTTP请求

gitlab自动部署PHP、NODE工程到生产服务器

Linux中rsync(远程同步)命令详解示例

PHP 8的新特性功能

最受欢迎程序解答

An SSL certificate error occurred when fetching the script

Docker Volume入门用法详解

Docker Network入门用法

go自定义tcp消息通信

像素转rem对照表

Golang获取零点时间戳最佳方法

使用Go列出文件夹中的文件

PHP8之属性注解(Attributes)

mysql JSON更新、插入、查询语法的最佳实践

PHP fiber示例多任务协作

相关程序解答推荐

终端(Terminal)中美化JSON的几种方法

python爬虫之自动登录discuz!刷分

python合并多个list类型

python中的像其它一样substring处理

python按固定的长度chunks

convert bytes to a string in python?

BeautifulSoup is not an HTTP client

urllib3 certificate verify failed: unable to get local issuer certificate

留言