Alberto Castellini
STATISTICAL LEARNING (2022/2023) (official webpage)
Master's degree in data science, Verona University


Syllabus

Theory: Linear models for regression. Cross-validation. Variable and model selection in linear regression models. Regularization for linear regression models. Methods for dimensionality reduction. Classification models (Logistic Regression, Linear Discriminant Analysis). Tree Based Methods (Decision Trees, Bagging, Random Forest, Boosting). Unsupervised methods (Principal Component Analysis, K-Means Clustering, Hierarchical Clustering). Introduction to Neural Networks (Single layer neural network, training a neural network). Laboratory: Introduction to data analysis with Python. Linear regression (Python). Variable and model selection in linear models (Python). Ridge and Lasso regularization for linear regression models (Python). Classification with logistic regression (Python). Data clustering with k-means and hierarchical approaches (Python). Artificial Neural Networks (Python).

Learning outcomes

The course aims to introduce students to the statistical models used in data science. The foundations of statistical learning (supervised and unsupervised) will be developed by placing the emphasis on the mathematical basis of the different state-of-the-art methodologies. It also aims to provide rigorous derivations of the methods currently used in industrial and scientific applications to allow students to understand their requirements for correct use. Laboratory sessions will illustrate the use of fundamental algorithms and industrial case studies in which the student will be able to learn to analyze real data-sets by means of Python software. At the end of the course the students have to demonstrate the following skills: - knowledge of the main stages of data preparation, model creation and evaluation - ability to develop solutions for feature selection - knowledge and ability to use the main regression and regularization models (e.g., LASSO, Ridge Regression) - knowledge and ability to use the main methods for dimensionality reduction (e.g., Principal Component Regression, Partial Least Squares); - knowledge and ability to use the main methods for classification (e.g., KNN, Logistic Regression, LDA) - knowledge and ability to use the main methods for tree-based regression and classification (e.g., decision tree, random forest) - knowledge and ability to use the main methods for unsupervised data analysis (e.g., K-means clustering, hierarchical clustering).

Reference books

  • James, Gareth. An introduction to statistical learning - with applications in R (Ed. 2). Springer. 2021. (pdf)
  • T. Hastie, R. Tibshirani, J. Friedman. The elements of statistical learning. Data mining, inference, and prediction (Ed. 2). Springer, 2009. (pdf)
Theory: coming soon.


Lab:
Slides
  • Introduction to data analysis with Python and R in Kaggle (pdf)
  • Linear methods for regression (pdf)
  • Variable Subset Selection (pdf)
  • Shrinkage (regularization) methods for variable selection (pdf)
  • Unsupervised learning: clustering analysis (pdf)
  • Artificial neural networks - Prediction of house value: California housing dataset (pdf)
Exercises
  • Exercise 1 (Part 1): Telco Customer Churn first data analysis using Python (pdf)
  • Exercise 1 (Part 2): Telco Customer Churn first data analysis using Python (pdf)
  • Exercise 2: Prediction on the prostate cancer dataset using OLS regression in Python (pdf)
  • Exercise 3: Variable subset selection with OLS regression on the prostate cancer dataset in Python (pdf)
  • Exercise 4: Shrinkage (regularization) methods with regression on the prostate cancer dataset in Python (pdf)
  • Exercise 5: Clustering on the human tumor dataset in Python (pdf)
  • Exercise 6: Artificial neural networks - Prediction of house value: California housing dataset (pdf - see last slide)
Other matherial
  • Python 3 tutorial (pdf)