public class NLineInputFormatFixed
NLineInputFormat which splits N lines of input as one split.
In many "pleasantly" parallel applications, each process/mapper
processes the same input file (s), but with computations are
controlled by different parameters.(Referred to as "parameter sweeps").
One way to achieve this, is to specify a set of parameters
(one set per line) as input in a control file
(which is the input path to the map-reduce application,
where as the input dataset is specified
via a config variable in JobConf.).
The NLineInputFormat can be used in such applications, that splits
the input file such that by default, one line is fed as
a value to one map task, and key is the offset.
i.e. (k,v) is (LongWritable, Text).
The location hints will span the whole mapred cluster.
protected static org.apache.hadoop.mapreduce.lib.input.FileSplit createFileSplit(org.apache.hadoop.fs.Path fileName,
NLineInputFormat uses LineRecordReader, which always reads
(and consumes) at least one character out of its upper split
boundary. So to make sure that each mapper gets N lines, we
move back the upper split limits of each split
by one character here.
fileName - Path of file
begin - the position of the first byte in the file to process
length - number of bytes in InputSplit
public static void setNumLinesPerSplit(org.apache.hadoop.mapreduce.Job job,
Set the number of lines per split
job - the job to modify
numLines - the number of lines per split
public static int getNumLinesPerSplit(org.apache.hadoop.mapreduce.JobContext job)