以字詞類別概念輔助部落格文件分群之研究
No Thumbnail Available
Date
2010
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本論文研究使用ODP (Open Directory Project)目錄結構做為外部知識來源,透過ODP的查詢功能得到字詞的所屬類別作為特徵,結合文章中所有字詞所屬的類別及比重值來建構出特徵向量,希望改進單純以關鍵字擷取建立特徵向量的缺點,進而達到較好的主題式文章分群效果。此外,每個部落格中文章內容主題的集中度不同,在以K-Means演算法進行分群時,經常遇到的問題是不知道如何設定適當的聚落數目K值,本論文研究亦提出根據文章集合中各文章的特徵向量自動決定K-Means演算法的聚落數目及初始代表點,使部落格文章分群能更自動化。
我們將類別特徵向量法與字詞特徵向量法分別套用在文章分群實驗上,並將分群結果以Accuracy及Purity值進行評估,評估結果顯示類別特徵向量法在測試集中大多數的部落格皆能得到比字詞特徵向量法更好的分群結果。此外,實驗顯示結合文章的標題詞與複合詞類別特徵向量可進一步提升文章分群的效果。
Our approach uses ODP (Open Directory Project) directory structure as the external knowledge. Through the query function of ODP, we can get categories of query word, and we set those categories as word feature. To build category feature vector of post, we merging all of categories of post words and corresponding weight of words. We hope to improve the drawback of using keyword frequency to build feature vector, and achieve better topic based clustering result. We propose a method to assist the decision of K value in K-means algorithm. We take the category relation between each posts of a blog into consideration which makes clustering more automation. We compare the clustering result of our approach with term based feature vector in Purity and Accuracy measure. The experiments show that our approach is better than term based feature vector approach. We also combine the title and phrase of a post as other feature vectors, and prove these two features can assist clustering effectively.
Our approach uses ODP (Open Directory Project) directory structure as the external knowledge. Through the query function of ODP, we can get categories of query word, and we set those categories as word feature. To build category feature vector of post, we merging all of categories of post words and corresponding weight of words. We hope to improve the drawback of using keyword frequency to build feature vector, and achieve better topic based clustering result. We propose a method to assist the decision of K value in K-means algorithm. We take the category relation between each posts of a blog into consideration which makes clustering more automation. We compare the clustering result of our approach with term based feature vector in Purity and Accuracy measure. The experiments show that our approach is better than term based feature vector approach. We also combine the title and phrase of a post as other feature vectors, and prove these two features can assist clustering effectively.
Description
Keywords
資料探勘, 部落格文章分群, 類別特徵向量, Data Mining, Blog Post Clustering, Category Feature Vector