How to auto re-launch a YARN Application Master on a failure.
1)Use Case:
The fundamental
idea of Hadoop2 (Map-Reduce + Yarn) is to split up the two major
functionalities of the JobTracker, resource management and job
scheduling/monitoring, into separate daemons. The idea is to have a global
ResourceManager (RM) and per-application ApplicationMaster (AM).
An application is either a single job in the classical sense of Map-Reduce jobs
or a DAG of jobs.
The ResourceManager
and per-node slave, the NodeManager (NM), form the data-computation
framework. The ResourceManager is the ultimate authority that arbitrates
resources among all the applications in the system.
The per-application
ApplicationMaster is, in effect, a framework specific library and is tasked
with negotiating resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the tasks.
The ResourceManager has two main
components:
1. Scheduler: is responsible for
allocating resources to the various running applications.
2. ApplicationsManager: is responsible
for accepting job-submissions, negotiating the first container for executing
the application specific ApplicationMaster and provides the service for
restarting the ApplicationMaster container on failure.
In cluster we can’t say when which node
will go down/fail, so we have to setup our cluster very carefully to prevent
ant stopping or data loss because of any node/deamon failure. So for that we
can make it auto restart on failure.
2)Solutions:
To
enable Application Master restarts:
1.
You
can adjust the
yarn.resourcemanager.am.max-retries
property in the yarn-site.xml
file. The default setting is 2.
2.
You
can more directly tune how many times a MapReduce Application Master should
restart by adjusting the
mapreduce.am.max-attempts
property in the mapred-site.xml
file. The default setting is 2.yarn.app.mapreduce.am.job.recovery.enable
property in the yarn-site.xml
file to enable recovery of completed
tasks. The default setting is "true".The 2 setting means Application Master can get a 1 attempt to relaunch automatically, so if we want to set it more attempts then you need to set it accordingly.
3)Working
When a node goes down, the corresponding containers
including any Application Masters also get terminated. When that happens,
Resource Manager does two things:
(1) For Application Masters, it automatically
restarts them based on application policies (2) For containers, it informs the corresponding ApplicationMasters which can then take necessary action (retry, kill application etc.).
When the ApplicationMaster fails, the ResourceManager simply starts another container with a new ApplicationMaster running in it for another application attempt. It is the responsibility of the new ApplicationMaster to recover the state of the older ApplicationMaster, and this is possible only when ApplicationMasters persist their states in the external location so that it can be used for future reference. ApplicatoinMaster will store their state to persisitant disk thus all the status till the failure can be recovered.
4) Description
In Hadoop 1 the death of the Job Tracker would result in
the loss of all jobs -- both running and queued. With a YARN MapReduce job, the
equivalent to the Job Tracker is the Application Master. Because the
Application Master will now run on compute nodes, this can lead to an increase
in failure scenarios. To combat MapReduce Application Master failures,
YARN has the capability to restart a specified number of
times, as well as the capability to recover completed tasks. Additionally, much
like the Job Tracker, the Application Master keeps metrics for jobs that
are currently running. Typically the Application Master tracking URL makes
these available, and these metrics can be found in the YARN web UI enable MapReduce recovery in YARN.
During the execution of an
application, the ApplicationMaster communicates NodeManagers through NMClientAsync object. All container events are
handled by NMClientAsync.CallbackHandler,
associated with NMClientAsync.
A typical callback handler handles client start, stop, status update and error.
ApplicationMaster also reports execution progress to ResourceManager by
handling the getProgress() method of AMRMClientAsync.CallbackHandler.
5) Referances:
Very useful..Thanks bro
ReplyDeleteVery nice article,Thank you for sharing this Blog.
ReplyDeleteKeep Updating....
Big Data Hadoop Course