使用Python进行中文词频统计

February 20, 2017 (最后修改: October 07, 2021)

programming/algorithm, programming/python

最近在看《巨婴国》这本畅销书，先不论书写得怎么样，我就对书中的一个现象很感兴趣：谈论母亲远远多于谈论父亲。正好可以学学如何使用 Python 进行中文词频统计。

中文分词

统计词频的首要任务就是分词。目前有很多中文分词的Python库，我选用jieba分词。
使用方法极其简单，依次读取对文本文件的每一行并分词。

tokens = []
with open(os.path.join(os.path.dirname(__file__), "JuYingGuo.txt")) as f:
    lines = f.readlines()
    for line in lines:
        clean_line = line.strip()
        if len(clean_line) &gt; 0:
            seg_list = jieba.cut(clean_line)
            tokens.append(seg)

注意需要删除每行末尾的换行符，并过滤掉空行。
分词结果如下图所示：

词频统计

分词得到的是整个文本的单词列表，统计词频需要统计每个词的个数。 Python 标准库 collection 提供的 Counter 类可以很方便地统计个数。

counter = Counter(tokens)

可以使用 most_common 方法直接输出词频最高的5个单词

for a in counter.most_common(5):
    print(a[0] + '\t' + str(a[1]))

结果如下表所示：

单词	个数
，	21492
的	11396
。	7104
是	3682
我	2763

结果不尽如人意，最高频的竟然是标点符号，第二位的是助词，第五位是人称代词。我对这类词语不感兴趣，只关心有实际意义的词语。所以需要对分词的结果进行处理。这就需要使用停用表对结果进行过滤。

停用表

停用表（stop words）就是那些“没有”实际意义的词，在词频统计时可以过滤掉。有多家机构维护停用词表。我在下面的 github 链接中找到几份停用词表，包括哈工大、百度、中科院、四川大学等四家机构的停用词表。

https://github.com/LIEWyiyi/JD_spider/tree/master/JD_spider/stopword

分词后，只将那些不在停用词列表里的词加入到 token 中。

tokens = []
stop_words = []
stop_words_file_list = [
    "stop_words/hit_stop_words.txt",
    "stop_words/baidu_stop_words.txt",
    "stop_words/zh_cn_stop_words.txt",
]
for file_path in stop_words_file_list:
    with open(os.path.join(os.path.dirname(__file__), file_path)) as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip().decode('utf-8'))
with open(os.path.join(os.path.dirname(__file__), "JuYingGuo.txt")) as f:
    lines = f.readlines()
    for line in lines:
        clean_line = line.strip()
        if len(clean_line) &gt; 0:
            seg_list = jieba.cut(clean_line)
            for seg in seg_list:
                if seg not in stop_words:
                    tokens.append(seg)

这样得到词频最高的5个单词如下：

单词	个数
会	1099
中	885
说	827
孩子	701
妈妈	664

很明显，过滤掉停用词后可以得到更有用的数据。 “妈妈”成功杀入前五，与我阅读时的感觉一致。

《巨婴国》中的父母

列一下《巨婴国》中最高频的百词中与家庭相关的词汇。

排名	单词	次数
4	孩子	701
5	妈妈	664
6	婴儿	459
10	父母	438
26	男人	232
27	家庭	217
31	巨婴	205
36	父亲	186
46	母亲	158
53	儿子	137
60	女儿	127
66	女人	124
67	妻子	122
90	男	101
100	丈夫	90

下图更明显，“妈妈”的词频远高于“男人”和“父亲”。

看来《巨婴国》这本书更适合女性读者了。