MapReduce

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster

Phases Involved

Map Phase: Each worker node applies the "map()" function to the local data to generate intermediate key/value pairs and writes the output to a temporary local storage. A master node ensures that only one copy of redundant input data is processed.
Shuffle Phase: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
Reduce Phase: Worker nodes now process each group of output data, per key, in parallel.

YARN Scheduler

YARN (Yet Another Resource Negotiator) treats each server as collection of containers (CPU + Memory)

Components

Global - Resource manager
Per server - Node manager
Per application - Application master

Failures

Server failure
- Node manager heart beats to resource manager, If server fails RM lets all affected AM's know and AM's take action
- Node manager keeps track of each task running at its server, If task fails NM will restart it
- AM heart beats to RM, On failure RM restarts AM which then sync up with currently running tasks
Resource Manager failure : Use old checkpoints and brings up secondary RM

Slow servers

Keeps track of progress of each task and create backup of straggler task and consider it done when first replica of task is completed, Also called speculative execution

Locality

MR task tries to schedule task on machine that contain replica of corresponding input data

Map-Reduce

MapReduce

YARN Scheduler

Failures

Slow servers

Locality

results matching ""

No results matching ""