Quantcast
Viewing all articles
Browse latest Browse all 11

Join processing with threshold-based filtering in MapReduce

Abstract

Data analytics, in particular those involving heterogeneous data, often require join operations on datasets collected from different sources. MapReduce, one of the most popular frameworks for large-scale data processing, is not suited for joining multiple datasets. This is because MapReduce often produces a large number of redundant intermediate results, irrespective of the size of the joined records. Although several existing approaches attempt to reduce the number of such redundant results using Bloom filters, they may be inefficient if large portions of records are joined or the number of distinct keys is large. To alleviate this problem, we propose a join processing method with threshold-based filtering in MapReduce, called TMFR-Join, which is an abbreviation for “Threshold-based Map-Filter-Reduce Join”. TMFR-Join applies filters according to their performance, which is estimated in terms of false-positive rates. It also provides a general framework for exploiting various filtering techniques that support certain desired operations. The experimental results indicate that the performance of TMFR-Join is close to that of the better of existing join processing techniques, both with and without filters.


Viewing all articles
Browse latest Browse all 11

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>