What is it?
Mesos is an independent
cluster platform which hosts
cluster computing frameworks.
Giving the fact that cluster platform's main responsibility to manage cluster resources a good example of similar architecture in Hadoop would be YARN and Tez, where YARN is native Hadoop
cluster platform with Tez being domain-specific (domain being DAG)
cluster computing framework hosted by YARN. This allows not only multi-framework hosting but also multi-instances and/or versions of the same framework hosted in the same shared cluster.
Architecture
Mesos - Consists of Master and Slave nodes with allocation modules hosted by Master
Framework - Consists of Scheduler and Executor processes launched on every node (to execute framework's tasks)
Scheduling and data locality
Mesos introduces the concept of
decentralized scheduler. The arguments for that are:
- the need for expressive API to support all the frameworks (known and un-known)
- new frameworks may already come with its own scheduling mechanism, so refactoring it into centralized would not be feasible
So Mesos delegates scheduling to the frameworks themselves via
resource offers. Mesos offers resources to frameworks based on
organizational policies (e.g., fair, priority etc), while the framework itself decides to accept/reject or how much of resources to accept.
Slave reports resource availability to the Master ->
Master(4 cpu and 8Gb on this slave) via allocation policy module ->
Framework's scheduler (takes 2 CPU, 2GB mem, on this slave or rehect) ->
Master(OK, and launches the tasks on such slave)
Unused resources are offered to other frameworks and the overall process repeats upon completion of tasks.
This is one of the mechanism frameworks utilize to achieve
data locality - by accepting and/or rejecting resource offers from slaves based on data proximity.
So the main design philosophy for Mesos - define a minimal interface that enables efficient resource sharing across frameworks, and otherwise push control of task scheduling and execution to the frameworks.
Allocation module
Based on assumption that task have a short life span which means it can reallocate resources as quick as tasks complete. However in a cases of long-running tasks Mesos can revoke (kill) tasks by killing task itself and/or killing executor if task does not respond. It's all in implementation detail of allocation module policies.
While MR based frameworks can easily live with re-allocated/killed tasks, MPI-style frameworks can't, so allocation modules can expose
guaranteed allocation to each framework - basically a set of resources permanently taken by the framework.
Since resource acceptance is based on
accept or
reject mechanism, rejection in itself may introduce an unnecessary chatter especially for cases where certain slaves would always reject a particular scheduling request. Mesos allows such chatter to be further reduced by providing
allocation filters to the master essentially allowing resource offer not to go out to a particular slave if it does not pass a filter criteria
Resource offers will be rescinded if a framework does not respond timely to resource offer, thus allowing such resources to be re-offered to other frameworks.
Performance Isolation
Performance isolation achieved via native mechanisms (e.g., Linux containers, etc.)
Master fault tolerance (hot stand-by)
Master has few standby nodes. In a case of master failure ZooKeeper will select a new master from the list. Since all (hot and standby) masters receive events from the cluster, a new master will be able to reconstruct its latest state from these events.
Opinion:
Mesos may have its benefits in heterogeneous clusters often seen in small/medium sized businesses where maintaining a domain specific cluster is not cost efficient due to low load and high idle scenarios, so a single cluster platform can host multiple cluster computing frameworks sharing resources between them, thus limiting the idle time.
However, in large scale high load data clusters (e.g., Hadoop) having a native cluster platform such as YARN is more beneficial based on a simple premise that specialized solutions are generally better then generalized.
Having said that, YARN is perfectly capable of hosting cluster computation frameworks which may have nothing to do with Hadoop all together, thus allowing Hadoop/YARN cluster to be used as heterogeneous cluster compute platform.
Missrepresentation:
In Mesos Architecture white paper (
http://mesos.berkeley.edu/mesos_tech_report.pdf). In Page 2 Section 2 it talks about an example of what needs to happen to data sitting in Hadoop after MR processing if such data needs to be integrated with MPI. It talks about requirement for spinning up a new MPI cluster and moving terrabytes of data from Hadoop cluster while making the argument that MPI could be just another compute framework hosted in the same cluster.
That is actually
not true and these capabilities existed in Hadoop even before YARN/Tez, since MPI concepts could always be implemented as part of your MR job (although I'd admit it was not simple). However with YARN and Tez it is relatively simple. In fact Tez is being more flexible when it comes to this sort of requirements since it is a low-level DAG framework comprised of simple Vertexes "processors" (filters) connected by Edges (pipes) as in standard
pipes-and-filters architecture.
Add a comment