以MapReduce分散式架構有效率進行相似資料配對搜尋之研究
No Thumbnail Available
Date
2015
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
為了有效率計算相似資料配對搜尋問題,本論文提供一個有效篩除不相似配對的方法,並設計成對應的MapReduce平行處理架構。我們提出以最大值出現維度對資料預先分群,並設計改變分群數的方式及平衡各群間資料個數的分群調整方法。根據分群結果,將相似配對計算工作分成群內相似配對以及跨群相似配對;群內相似配對運用反向串列索引方式快速計算,在跨群配對方面,本研究先以計算群代表向量的方式對進行群配對的篩除,計算可能產生相似配對的群配對,再對群配對中資料進行候選配對的組合。此外本研究也運用MapReduce平行處理架構,以平行處理上述各步驟所需執行工作。實驗結果顯示此方法適合採用MapReduce平行處理架構,可比集中式處理有效減少相似資料配對問題的回覆時間。
For solving the All Pair Similarity Search (APSS) problem efficiently, this thesis aims to provides an effective approach to filter non-similar pairs. Besides, the MapReduce version of the approach is provided to solve the problem in parallel. At first, for each data point, the dimension with the maximum value is used to decide the corresponding group of data partition. An adjusting method is designed to further balance the number of elements in each data group. The similar pairs could be inter-group similar pairs or intra-group similar pairs. For finding the intra-group similar pairs, we apply the inverted list index to improve the efficiency of computing the similarity of data pairs in each group. In addition, for speeding up the computation of finding inter-group similar pairs, the leader-vector is used to represent each group for estimating the upper bound of similarity between each group pair. Only the pairs of groups, whose upper bounds of similarity are larger than the given similarity threshold, need to exactly compute similarity of the inter-group data pairs. Finally, we design the corresponding MapReduce algorithm to perform the proposed approach in parallel. The experimental results show that the proposed partition-based method can fit into the MapReduce programing scheme properly. Moreover, the proposed MapReduce algorithm can effectively reduce the response time of solving the APSS problem than the centralized approach.
For solving the All Pair Similarity Search (APSS) problem efficiently, this thesis aims to provides an effective approach to filter non-similar pairs. Besides, the MapReduce version of the approach is provided to solve the problem in parallel. At first, for each data point, the dimension with the maximum value is used to decide the corresponding group of data partition. An adjusting method is designed to further balance the number of elements in each data group. The similar pairs could be inter-group similar pairs or intra-group similar pairs. For finding the intra-group similar pairs, we apply the inverted list index to improve the efficiency of computing the similarity of data pairs in each group. In addition, for speeding up the computation of finding inter-group similar pairs, the leader-vector is used to represent each group for estimating the upper bound of similarity between each group pair. Only the pairs of groups, whose upper bounds of similarity are larger than the given similarity threshold, need to exactly compute similarity of the inter-group data pairs. Finally, we design the corresponding MapReduce algorithm to perform the proposed approach in parallel. The experimental results show that the proposed partition-based method can fit into the MapReduce programing scheme properly. Moreover, the proposed MapReduce algorithm can effectively reduce the response time of solving the APSS problem than the centralized approach.
Description
Keywords
相似資料配對問題, 分群篩除方法, MapReduce演算法, All Pair Similarity Search(APSS), partition based data filtering, MapReduce algorithm