nlp——Python中的文本聚类和相似性可视化

学无止境 lv.1

发布时间：2022-04-20 05:43:49 559

相关标签： # node.js

我在分析一本1167页的书（txt文件）。到目前为止，我已经完成了数据的预处理（清理、删除标点、停止删除单词、标记化）。现在，我如何对文本进行聚类，并将相似性可视化（绘制聚类图）

比如-

text1 = "there is one unshakable conviction that people—whatever the degree of development of their understanding and whatever the form taken by the factors present in their individuality for engendering all kinds of ideal"
text2 = "That is why I now also, in setting forth on this venture quite new for me, namely authorship, begin by pronouncing this invocation"

在我的任务中，我把整本书分成了几章。tex1是第一章TEX2是第二章，依此类推。现在我想比较一下第一章和第二章。

我用于数据预处理的示例代码：

# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(pages[1])

# convert to lower case
tokens = [w.lower() for w in tokens]

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

特别声明：以上内容（图片及文字）均为互联网收集或者用户上传发布，本站仅提供信息存储服务！如有侵权或有涉及法律问题请联系我们。