hadoop cat

作者: 你相公
来源: 51数据库
2020-09-22

(Partition)分区出现的必要性，如何使用Hadoop产生一个全局排序的文件？最简单的方法就是使用一个分区，但是该方法在处理大型文件时效率极低，因为一台机器必须处理所有输出文件，从而完全丧失了MapReduce所提供的并行架构的优势。
事实上我们可以这样做，首先创建一系列排好序的文件；其次，串联这些文件（类似于归并排序）；最后得到一个全局有序的文件。主要的思路是使用一个partitioner来描述全局排序的输出。比方说我们有1000个1-10000的数据，跑10个ruduce任务，如果我们运行进行partition的时候，能够将在1-1000中数据的分配到第一个reduce中，1001-2000的数据分配到第二个reduce中，以此类推。即第n个reduce所分配到的数据全部大于第n-1个reduce中的数据。
这样，每个reduce出来之后都是有序的了，我们只要cat所有的输出文件，变成一个大的文件，就都是有序的了
基本思路就是这样，但是现在有一个问题，就是数据的区间如何划分，在数据量大，还有我们并不清楚数据分布的情况下。一个比较简单的方法就是采样，假如有一亿的数据，我们可以对数据进行采样，如取10000个数据采样，然后对采样数据分区间。在Hadoop中，patition我们可以用TotalOrderPartitioner替换默认的分区。然后将采样的结果传给他，就可以实现我们想要的分区。在采样时，我们可以使用hadoop的几种采样工具，RandomSampler,InputSampler,IntervalSampler。
这样，我们就可以对利用分布式文件系统进行大数据量的排序了，我们也可以重写Partitioner类中的compare函数，来定义比较的规则，从而可以实现字符串或其他非数字类型的排序，也可以实现二次排序乃至多次排序。转载，仅供参考。

　　想使用partitioner，首先需要知道这个东西是做什么的。

partitioner partitions the key space.
partitioner controls the partitioning of the keys of the intermediate map-outputs. the key (or a subset of the key) is used to derive the partition, typically by a hash function. the total number of partitions is the same as the number of reduce tasks for the job. hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.
hashpartitioner is the default partitioner.
大概意思就是：partitioner控制着map任务的输出的key的分区，也就是会根据partitioner对key进行分区，以方便传输给不同的reduce节点处理，分区的总数等于reduce的任务个数。默认的partitioner是hashpartitioner。
引用自
2. 如何使用？
......
configuration conf = getconf();

//create job
job job = new job(conf, "hello");
......
//set partitioner statement
job.setpartitionerclass(hashpartitioner.class);