JYuan Learning Log: Learning Hadoop Concepts

When you launch a job, Hadoop copies the files specified by the -files, -archives,
and -libjars options to the distributed filesystem (normally HDFS). Then, before a
task is run, the tasktracker copies the files from the distributed filesystem to a local disk
—the cache—so the task can access the files. files specified by -libjars are added to the task’s
classpath before it is launched.

Files are localized under the ${mapred.local.dir}/taskTracker/archive directory on
the tasktrackers. Applications don’t have to know this, however, because the files are
symbolically linked from the task’s working directory.

Path[] localPaths = context.getLocalCacheFiles();

Chaining MapReduce jobs in a sequence
Recall that a driver
sets up a JobConf object with the configuration parameters for a MapReduce job
and passes the JobConf object to JobClient.runJob() to start the job. As Job-
Client.runJob() blocks until the end of a job, chaining MapReduce jobs involves
calling the driver of one MapReduce job after another. The driver at each job will
have to create a new JobConf object and set its input path to be the output path of
the previous job. You can delete the intermediate data generated at each step of the
chain at the end.

Chaining MapReduce jobs with complex dependency
In addition to holding job configuration information, Job also holds
dependency information, specified through the addDependingJob() method.

Whereas Job objects store the configuration
and dependency information, JobControl objects do the managing and monitoring of
the job execution. You can add jobs to a JobControl object via the addJob() method.
After adding all the jobs and dependencies, call JobControl’s run() method to spawn
a thread to submit and monitor jobs for execution. JobControl has methods like all-
Finished() and getFailedJobs() to track the execution of various jobs within the batch.

Hadoop introduced the ChainMapper and the ChainReducer
classes in version 0.19.0 to simplify the composition of pre- and postprocessing

You call the addMapper() method
in ChainMapper and ChainReducer to compose the pre- and postprocessing steps,
respectively. Running all the pre- and postprocessing steps in a single job leaves no
intermediate file and there’s a dramatic reduction in I/O.

In the standard Mapper model ,
the output key/value pairs are serialized and written to disk,1 prepared to be shuffled
to a reducer that may be at a completely different node. Formally this is considered to
be passed by value , as a copy of the key/value pair is sent over. In the current case where
we can chain one Mapper to another, we can execute the two in the same JVM thread.
Therefore, it’s possible for the key/value pairs to be passed by reference , where the output
of the initial Mapper stays in place in memory and the following Mapper refers to it
directly in the same memory location.

JYuan Learning Log

Sunday, January 12, 2014

Learning Hadoop Concepts

No comments:

Post a Comment