使用Python下载网络小说

November 29, 2017 (最后修改: October 07, 2021)

与之前一篇博文重复，写本文时忘记了，请参看《制作网络小说的电子书》

作为一个网络小说资深盗版读者，从不会为了小说资源而发愁。很久之前用过一款叫做《小说下载浏览器》的软件，集成强大的搜索、下载、制作功能，配上丰富的网址资源和模板，是我当时看小说的首选，可惜后来被封杀了。最近使用一款叫做《追书神器》的APP，但已有明显的收费趋势，只能使用老版本。所谓靠人不如靠己，说的就是应该自己写程序下载小说。好在现在盗版小说网站遍地都是，想写个程序不难。

任务

下载小说分为两个步骤：

获取目录
获取章节内容

下面就以云来阁为例，分别介绍这两个步骤。

获取目录

一般小说网站都有小说目录页，例如云来阁上的《五行天》目录页

//www.yunlaige.com/html/18/18535/index.html

首先需要使用 requests 获取目录页，并使用 BeautifulSoup 解析，留意网页的编码。

response = requests.get(url, proxies=proxy_config)
bs_object = BeautifulSoup(response.content, "html.parser", from_encoding='gb18030')

目录页一般以表格（table）、列表（dd、dt）等形式列出所有章节。例如

<table border="0" align="center" cellpadding="3" cellspacing="1" id="contenttable" class="table">
  <tr>
    <td>
      <a href="8751561.html">第一章 决定</a>
    </td>
    <td>
      <a href="8751562.html">第二章 报道</a>
    </td>
    <td>
      <a href="8751563.html">第三章 怒火</a>
    </td>
  </tr>
</table>

其中每个 <a> 标签都代表一个章节，获取目录就是要找到该表格元素，抽取该节点下的所有 <a> 节点，组成目录。

上面的表格有明显的 id，使用 select_one 定位该表格，然后使用 findAll 寻找所有的<a>节点。

content_container_node = bs_object.select_one('#contenttable')
link_nodes = content_container_node.findAll('a')

针对每个 <a> 节点，可以得到章节标题和对应的网页地址。其中使用节点的 string 方法获取节点内的字符串。

href = a_link['href']
link = urljoin(url, href)
name = str(a_link.string)

最后将各个章节信息组合起来，就能得到整个小说的目录。

另外从目录中也可以获取小说的标题和作者姓名，完整的代码如下：

def get_novel_contents_info(url, proxy_config, socket_config) -> dict:
    response = requests.get(url, proxies=proxy_config)
    bs_object = BeautifulSoup(response.content, "html.parser", from_encoding='gb18030')
    content_container_node = bs_object.select_one('#contenttable')
    link_nodes = content_container_node.findAll('a')
    title_div = bs_object.select_one('.title')
    title_string = title_div.find('h1').string[:-4]
    author_string = str(title_div.find('span').string)[3:]
    contents = []
    for a_link in link_nodes:
        if 'href' not in a_link.attrs:
            continue
        href = a_link['href']
        link = urljoin(url, href)
        name = str(a_link.string)
        contents.append({
            'name': name,
            'link': link
        })
    return {
        'title': title_string,
        'author': author_string,
        'contents': contents,
        'content_url': url
    }

获取章节内容

小说网站上章节页面的格式比较固定，例如云来网《五行天》的某一章：

https://www.yunlaige.com/html/18/18535/8754614.html

与目录相同，首先获取网页内容：

response = requests.get(url, proxies=proxy_config)
bs_object = BeautifulSoup(response.content, "html.parser", from_encoding='gb18030')

章节页面也有明显的结构，例如：

<div id="contentbox" class="contentbox" style="border:1px solid #DDDDDD">
  <p class="ctitle">
    第八章 剑胎种子
  </p>

  <div id="content">
    <div class="kongwei">
    </div>
    <div class="ad250left">
    </div>
    <div class="kongwei2">
    </div>
    <div class="ad250right">
    </div>
    &nbsp;&nbsp;&nbsp;&nbsp;黑暗的房间，床上倚着墙角抱剑而坐的艾辉缓缓睁开眼睛。
    <br />
    <br />
    &nbsp;&nbsp;&nbsp;&nbsp;比黑夜还深邃的眼睛睁开的瞬间，漆黑的房间仿佛有一道寒芒闪过。这缕犀利冷冽的光芒一闪而逝，艾辉又恢复到无害的模样。<br />
    <br />
    &nbsp;&nbsp;&nbsp;&nbsp;离开蛮荒有些天，他还没有习惯躺在床上睡觉。<br />
    <br />
    &nbsp;&nbsp;&nbsp;&nbsp;检查了一下体内温养了三年的剑胎种子，没有任何变化。<br />
  </div>
</div>

我们只需要找到章节的标题和正文，而将其他无关紧要的标签过滤掉。

使用 id 和 class 可以确定标题。

title_node = bs_object.select_one('.bookname h1')

使用 id 确定章节内容，并使用节点的 strings 方法提取所有文字，strings 方法返回一个生成器，后面会使用。

content_node = bs_object.select_one('#content')
paragraphs = content_node.strings

确定标题和内容后，就需要将其重组成新的网页，只包含小说内容。我使用 BeautifulSoup 创建网页。

html_chapter = BeautifulSoup("
<html>
 <head>
 </head>
 <body>
  <div>
   <div>
    <p></p>
   </div>
  </div>
 </body>
</html>", "html5lib")
    main_div = html_chapter.div
    html_title_node = html_chapter.new_tag('h1')
    html_title_node.string = title_node.string
    main_div.append(html_title_node)
    html_content_node = html_chapter.new_tag('div')
    main_div.append(html_content_node)
    for a_paragraph in paragraphs:
        p_text = a_paragraph.string.strip()
        p_node = html_chapter.new_tag('p')
        p_node.string = p_text
        html_content_node.append(p_node)

最后使用 html_chapter.prettify() 可以得到经过排版的HTML代码。

完整的代码如下：

def get_novel_chapter(url, proxy_config, socket_config):
    response = requests.get(url, proxies=proxy_config)
    bs_object = BeautifulSoup(response.content, "html.parser", from_encoding='gb18030')
    title_node = bs_object.select_one('.bookname h1')
    content_node = bs_object.select_one('#content')
    paragraphs = content_node.strings
    html_chapter = BeautifulSoup("
<html>
 <head>
 </head>
 <body>
  <div>
   <div>
    <p></p>
   </div>
  </div>
 </body>
</html>", "html5lib")
    main_div = html_chapter.div
    html_title_node = html_chapter.new_tag('h1')
    html_title_node.string = title_node.string
    main_div.append(html_title_node)
    html_content_node = html_chapter.new_tag('div')
    main_div.append(html_content_node)
    for a_paragraph in paragraphs:
        p_text = a_paragraph.string.strip()
        p_node = html_chapter.new_tag('p')
        p_node.string = p_text
        html_content_node.append(p_node)
    return html_chapter.prettify()

完整下载一部小说

完整下载首先要获取小说的目录，然后下载每一个章节。参见我的一个项目：

perillaroc/novel-downloader