倒排索引 mapreduce

作者: 私念丶染流年初情
来源: 51数据库
2020-09-21

由于在索引中的数据结构是倒排索引，其结构如下：倒排索引（）举例如下： Word occurrence@DocID........ cat 6@Doc1 3@Doc2 4@Doc3... etc Hot 9@Doc1 2@Doc3 10@Doc5... etc 倒排索引是MapReduce的关键部分。

　　et expansion），主要有如下几种方法（以document similarity为例）：brute force：最直接、暴力的方法，两个for循环，计算任意两篇文档之间的相似度，时间复杂度为o(n^2)。这种方法可以得到最好的效果，但是计算量太大，效率较差，往往作为baseline。
inverted index based：由于大量文档之间没有交集term，为了优化算法性能，只需计算那些包含相同term文档之间的相似度即可，算法伪代码如下：基于mapreduce的分布式计算框架如下：为了进一步优化计算，节省空间，研究人员提出了一系列剪枝策略和近似算法，详细见：《scaling up all pairs similarity search》、《pairwise document similarity in large collections with mapreduce》、《brute force and indexed approaches to pairwise document similarity comparisons with mapreduce》。
locality sensitive hashing（lsh）：通过对文档进行某种度量操作后将其分组散列在不同的桶中。在这种度量下相似度较高的文档被分在同一个桶中的可能性较高。主要用于near-duplicate detection和image similarity identification等，详细见：《approximate nearest neighbors: towards removing the curse of dimensionality》、《google news personalization: scalable online collaborative filtering》。