MapReduce

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster

Phases Involved

  • Map Phase: Each worker node applies the "map()" function to the local data to generate intermediate key/value pairs and writes the output to a temporary local storage. A master node ensures that only one copy of redundant input data is processed.

  • Shuffle Phase: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.

  • Reduce Phase: Worker nodes now process each group of output data, per key, in parallel.

YARN Scheduler

YARN (Yet Another Resource Negotiator) treats each server as collection of containers (CPU + Memory)

Components

  • Global - Resource manager
  • Per server - Node manager
  • Per application - Application master

Failures

  • Server failure

    • Node manager heart beats to resource manager, If server fails RM lets all affected AM's know and AM's take action
    • Node manager keeps track of each task running at its server, If task fails NM will restart it
    • AM heart beats to RM, On failure RM restarts AM which then sync up with currently running tasks
  • Resource Manager failure : Use old checkpoints and brings up secondary RM

Slow servers

Keeps track of progress of each task and create backup of straggler task and consider it done when first replica of task is completed, Also called speculative execution

Locality

MR task tries to schedule task on machine that contain replica of corresponding input data

results matching ""

    No results matching ""