Course Outline
PySpark & Machine Learning
Module 1: Big Data & Spark Foundations
- An overview of the Big Data ecosystem and Spark's role in modern data platforms
- Comprehending Spark architecture: drivers, executors, cluster managers, lazy evaluation, DAGs, and execution planning
- Distinguishing between RDD and DataFrame APIs and determining when to use each
- Establishing and configuring SparkSession along with understanding the fundamentals of application configuration
Module 2: PySpark DataFrames
- Ingesting and exporting data from enterprise sources and various formats (CSV, JSON, Parquet, Delta)
- Manipulating PySpark DataFrames: utilizing transformations, actions, column expressions, filtering, joins, and aggregations
- Executing advanced operations like window functions, managing timestamps, and handling nested data structures
- Implementing data quality checks and writing reusable, maintainable PySpark code
Module 3: Processing Large Datasets Efficiently
- Grasping performance fundamentals: partitioning strategies, shuffle behavior, caching, and persistence
- Applying optimization techniques such as broadcast joins and execution plan analysis
- Efficiently processing large datasets and adhering to best practices for scalable data workflows
- Understanding schema evolution and modern storage formats prevalent in enterprise environments
Module 4: Feature Engineering at Scale
- Conducting feature engineering with Spark MLlib: managing missing values, encoding categorical variables, and scaling features
- Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
- An introduction to feature selection and strategies for handling imbalanced datasets
Module 5: Machine Learning with Spark MLlib
- Understanding MLlib architecture and the Estimator/Transformer pattern
- Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
- Comparing models and interpreting results within distributed Machine Learning workflows
Module 6: End-to-End ML Pipelines
- Constructing end-to-end Machine Learning pipelines that integrate preprocessing, feature engineering, and modeling
- Implementing train/validation/test split strategies
- Executing cross-validation and hyperparameter tuning via grid search and random search
- Structuring reproducible Machine Learning experiments
Module 7: Model Evaluation & Practical ML Decision Making
- Applying suitable evaluation metrics for regression and classification challenges
- Identifying overfitting and underfitting and making informed decisions regarding model selection
- Interpreting feature importance and gaining insights into model behavior
Module 8: Production & Enterprise Practices
- Persisting and loading models within Spark
- Implementing batch inference workflows on large datasets
- Understanding the Machine Learning lifecycle in enterprise contexts
- An introduction to versioning, experiment tracking concepts, and fundamental testing strategies
Practical Outcome
- Competence in working autonomously with PySpark
- Proficiency in processing large datasets efficiently
- Capability to perform feature engineering at scale
- Skill in building scalable Machine Learning pipelines
Requirements
Participants are expected to have the following background knowledge:
Fundamental Python programming skills, including experience with functions, data structures, and libraries
A solid grasp of data analysis concepts such as datasets, transformations, and aggregations
Basic familiarity with SQL and relational data structures
Introductory knowledge of Machine Learning principles, including training datasets, features, and evaluation metrics
While not mandatory, familiarity with command line environments and basic software development practices is recommended
Prior experience with Pandas, NumPy, or comparable data processing libraries is advantageous but not required.
Testimonials (1)
I liked that it was practical. Loved to apply the theoretical knowledge with practical examples.