Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- Hadoop history and core concepts
- Ecosystem overview
- Distributions
- High-level architecture
- Hadoop myths
- Hadoop challenges (hardware / software)
- Labs: Discuss your Big Data projects and challenges
-
Planning and installation
- Selecting software and Hadoop distributions
- Sizing the cluster and planning for growth
- Selecting hardware and network infrastructure
- Rack topology
- Installation procedures
- Multi-tenancy considerations
- Directory structure and logs
- Benchmarking
- Labs: Cluster installation and performance benchmarking
-
HDFS operations
- Core concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring
- Command-line and browser-based administration
- Adding storage and replacing defective drives
- Labs: Getting familiar with HDFS command lines
-
Data ingestion
- Using Flume for logs and other data ingestion into HDFS
- Utilizing Sqoop to import from SQL databases to HDFS and export back to SQL
- Hadoop data warehousing with Hive
- Copying data between clusters (distcp)
- Using S3 as a complement to HDFS
- Data ingestion best practices and architectures
- Labs: Setting up and using Flume, and the same for Sqoop
-
MapReduce operations and administration
- Parallel computing before MapReduce: comparing HPC vs. Hadoop administration
- MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- MapReduce UI walk-through
- MapReduce configuration
- Job configuration
- Optimizing MapReduce
- Fool-proofing MR: key points to communicate to programmers
- Labs: Running MapReduce examples
-
YARN: new architecture and new capabilities
- YARN design goals and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: Investigating job scheduling
-
Advanced topics
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, upgrading Hadoop
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop high availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: Setting up monitoring
-
Optional tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5)
- Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)
Requirements
- Comfortable with basic Linux system administration
- Basic scripting skills
Prior knowledge of Hadoop and Distributed Computing is not required, as these topics will be introduced and explained during the course.
Lab environment
Zero Install: There is no need to install Hadoop software on students’ machines! A functional Hadoop cluster will be provided for students.
Students will need the following
- An SSH client (Linux and Mac users already have SSH clients; PuTTY is recommended for Windows)
- A browser to access the cluster. We recommend the Firefox browser with the FoxyProxy extension installed
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already