Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Brief overview of Python and Scala

Core Concepts (Theory):

  • Architecture
  • Resilient Distributed Datasets (RDD)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Practical Workshop: Basics in the Databricks Environment

  • RDD API exercises
  • Core transformation and action functions
  • PairRDDs
  • Join operations
  • Effective caching strategies
  • DataFrame API exercises
  • Spark SQL
  • DataFrame operations: select, filter, group, and sort
  • User-Defined Functions (UDFs)
  • Exploration of the Dataset API
  • Streaming capabilities

Practical Workshop: Deployment in the AWS Environment

  • Foundations of AWS Glue
  • Comparing AWS EMR and AWS Glue
  • Sample jobs in both environments
  • Evaluating advantages and disadvantages

Supplementary Content:

  • Introduction to Apache Airflow orchestration

Requirements

Programming proficiency (preferably in Python or Scala)

Foundational knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories