Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Introduction:
- Apache Spark within the Hadoop Ecosystem
- Brief overview of Python and Scala
Core Concepts (Theory):
- Architecture
- Resilient Distributed Datasets (RDD)
- Transformations and Actions
- Stages, Tasks, and Dependencies
Practical Workshop: Basics in the Databricks Environment
- RDD API exercises
- Core transformation and action functions
- PairRDDs
- Join operations
- Effective caching strategies
- DataFrame API exercises
- Spark SQL
- DataFrame operations: select, filter, group, and sort
- User-Defined Functions (UDFs)
- Exploration of the Dataset API
- Streaming capabilities
Practical Workshop: Deployment in the AWS Environment
- Foundations of AWS Glue
- Comparing AWS EMR and AWS Glue
- Sample jobs in both environments
- Evaluating advantages and disadvantages
Supplementary Content:
- Introduction to Apache Airflow orchestration
Requirements
Programming proficiency (preferably in Python or Scala)
Foundational knowledge of SQL
21 Hours
Testimonials (3)
Having hands on session / assignments
Poornima Chenthamarakshan - Intelligent Medical Objects
Course - Apache Spark in the Cloud
1. Right balance between high level concepts and technical details. 2. Andras is very knowledgeable about his teaching. 3. Exercise
Steven Wu - Intelligent Medical Objects
Course - Apache Spark in the Cloud
Get to learn spark streaming , databricks and aws redshift