Hadoop Architecture
A rack is a collection of nodes physically stored together and are connected to the same switch, So network bandwidth between any two nodes in the same rack is greater than network bandwidth between any two nodes on the different rack
A cluster is a collection of racks
Hadoop Architecture
Distributed file system
- IBM Spectrum scale
MapReduce engine
- Framework for performing calculations on the data in the file system
- Has builtin resource manager and scheduler
Two core components
- One Name Node (Master) [contains metadata about data stored]
Multiple Data Nodes (Slaves) [data is stored in these nodes]
Data can be replicated with some replication factor
- HDFS stores files in blocks(storage unit of HDFS)
- Follows WORM [Write Once Read Many]
- Instead of data to processing, hadoop can do processing to data, Such that there won't be any network congestion
Name Node
- NameNode keeps the entire metadata in RAM
- NameNode records changes to HDFS in a write ahead log called journal in its local file system