應用語句關係網路計算語句向心性之新聞事件摘要方法
No Thumbnail Available
Date
2014-07-01
Journal Title
Journal ISSN
Volume Title
Publisher
國立清華大學科技管理研究所
Abstract
摘錄式摘要技術的核心在於評估語句的摘要代表性,藉以排序語句作為摘錄 語句時的依據。本研究將語句視為節點,藉由語句相似度來決定節點間是否存在 連結,依此建構出語句關係網路模型。接著,衡量節點在網路中的重要性或對於 其他相連節點的影響性,提出:(1) Degree Centrality、(2) Normalized Similarity-based Degree Centrality、(3) HITS Centrality、(4) PageRank Centrality,及(5) iSpreadRank Centrality 的節點向心性分析;並以語句向心性作為語句的摘要代表性,藉此達到 排序語句的目的。最後,導入CSIS(Cross-Sentence Information Sub-sumption)過 濾重複性資訊,依序擷取語句組成摘要。實驗使用DUC 2004 資料集來驗證上述摘 要方法的可行性。在ROUGE-1 的指標下,結合不同語句向心性之摘要效能依序 是:iSpreadRank > Normalized Similarity-based Degree > PageRank > Degree > HITS。整體而言,實驗得知應用語句關係網路計算語句向心性之摘要方法確實可行。
Purpose: One widely-adopted summarization paradigm, sentence extraction, aims at extracting important sentences and composing them into a summary. The foundation towards sentence extraction is to assess importance of sentences in the summary so as to rank sentences for extraction. This paper employs graph-based text analysis to model documents and investigates measures of graph-based centrality as sentence salience in summarization. Design/methodology/approach: This paper models documents on the same (or related) topic as a sentence similarity network, in which a sentence is regarded as a node and relationship between sentences only exists if they are semantically related. Severalmethods for evaluating the importance of a node (i.e., a sentence) in the network are then proposed, namely: (1) Degree Centrality; (2) Normalized Similarity-based Degree Centrality; (3) HITS Centrality; (4) PageRank Centrality; and (5) iSpreadRank Centrality. All are designed on the basis of the idea that the importance of a node is determined not only by the number of nodes to which it connects, but also by the importance of its connected nodes. As to summary generation, CSIS (Cross-Sentence Information Sub-sumption) is employed for anti-redundancy while extracting sentences according to the sentence ranking produced based on the centrality of sentences. Findings: The proposed summarization method was evaluated using the ROUGE evaluation suite on the DUC 2004 news stories collection. Experimental results show that, while considering the ROUGE-1 metric, the performance ranking is: iSpreadRank > Normalized Similarity-base Degree > PageRank > Degree > HITS. Another experiment, conducted to combine sentence centrality with surface-level features, also presents competitive results, compared with the best participant in the DUC 2004 evaluation. Research limitations/implications: Directions for future research would be: (1) instead of symbolic-level analysis, to take into account semantics, such as synonymy, polysemy, and term dependency, while determining if two sentences are semantically related; (2) to investigate graph-based centrality developed in social network analysis for evaluating sentence salience in summarization; (3) to improve the cohesion andcoherence of summaries using natural language processing techniques, such as sentence planning and generation. Practical implications: The proposed summarization method is in an unsupervised manner; thus no training dataset is required. Since no domain-specific knowledge or deep linguistic analysis is exploited, the method is domain- and language-independent. However, it might lead to poor understanding of the input texts and would probably produces poor summaries, due to neither deep analysis of natural language processing performed, discourse structure considered, nor domain-specific knowledge involved in the process of summarization, Originality/value: The contributions of this work are threefold. First, this paper offers a sentence similarity network to model topic-related documents. Second, novelgraph-based sentence ranking methods are explored to rank the importance of sentences for extraction. Finally, the proposed method had been proven successful in a case study with the DUC 2004 benchmark dataset.
Purpose: One widely-adopted summarization paradigm, sentence extraction, aims at extracting important sentences and composing them into a summary. The foundation towards sentence extraction is to assess importance of sentences in the summary so as to rank sentences for extraction. This paper employs graph-based text analysis to model documents and investigates measures of graph-based centrality as sentence salience in summarization. Design/methodology/approach: This paper models documents on the same (or related) topic as a sentence similarity network, in which a sentence is regarded as a node and relationship between sentences only exists if they are semantically related. Severalmethods for evaluating the importance of a node (i.e., a sentence) in the network are then proposed, namely: (1) Degree Centrality; (2) Normalized Similarity-based Degree Centrality; (3) HITS Centrality; (4) PageRank Centrality; and (5) iSpreadRank Centrality. All are designed on the basis of the idea that the importance of a node is determined not only by the number of nodes to which it connects, but also by the importance of its connected nodes. As to summary generation, CSIS (Cross-Sentence Information Sub-sumption) is employed for anti-redundancy while extracting sentences according to the sentence ranking produced based on the centrality of sentences. Findings: The proposed summarization method was evaluated using the ROUGE evaluation suite on the DUC 2004 news stories collection. Experimental results show that, while considering the ROUGE-1 metric, the performance ranking is: iSpreadRank > Normalized Similarity-base Degree > PageRank > Degree > HITS. Another experiment, conducted to combine sentence centrality with surface-level features, also presents competitive results, compared with the best participant in the DUC 2004 evaluation. Research limitations/implications: Directions for future research would be: (1) instead of symbolic-level analysis, to take into account semantics, such as synonymy, polysemy, and term dependency, while determining if two sentences are semantically related; (2) to investigate graph-based centrality developed in social network analysis for evaluating sentence salience in summarization; (3) to improve the cohesion andcoherence of summaries using natural language processing techniques, such as sentence planning and generation. Practical implications: The proposed summarization method is in an unsupervised manner; thus no training dataset is required. Since no domain-specific knowledge or deep linguistic analysis is exploited, the method is domain- and language-independent. However, it might lead to poor understanding of the input texts and would probably produces poor summaries, due to neither deep analysis of natural language processing performed, discourse structure considered, nor domain-specific knowledge involved in the process of summarization, Originality/value: The contributions of this work are threefold. First, this paper offers a sentence similarity network to model topic-related documents. Second, novelgraph-based sentence ranking methods are explored to rank the importance of sentences for extraction. Finally, the proposed method had been proven successful in a case study with the DUC 2004 benchmark dataset.