Description
Foundations of Statistics for Data Science:
- Understanding the properties of an attribute: Central tendencies (Mean, Median, Mode); Measures of spread (Range, Variance, Standard Deviation); Basics of Probability Distributions; Expectation and Variance of a variable
- introduction to random variables, probability theory, conditional probability, and to a most powerful algorithm in probability theory – Bayes Theorem.
- Understanding the properties of an attribute: Central tendencies (Mean, Median, Mode); Measures of spread (Range, Variance, Standard Deviation); Basics of Probability Distributions; Expectation and Variance of a variable
- Discrete probability distributions: Bernoulli, Binomial, Geometric, Poisson and properties of each.
- Continuous probability distributions: Exponential; Special emphasis on Normal distribution; t-distribution
- how to conduct a statistical hypothesis testing and will be introduced to various methods such as chi-square test, t-test, z-test, F-test and ANOVA methods in detail.
R & Python:
- R and Python basics, understanding data structures, functions, control structures, data manipulations, date and string manipulations, etc.
- Pre-processing Techniques: Binning, Filling missing values, Standardization & Normalization, type conversions, train-test data split, ROCR1
- Hands-on implementation of all the pre-processing techniques in R and Python.
Machine learning Models:
KNN Model:
- Computational geometry; Voronoi Diagrams; Delaunay Triangulations
- K-Nearest Neighbor algorithm; Wilson editing and triangulations
- Aspects to consider while designing K-Nearest Neighbor
- Hands-on example of K-Nearest Neighbor using R
- Collaborative filtering and its application areas
SVM
- Support Vector Machines (SVM) is the most elegant technique developed in the last two decades. You will learn about this extremely powerful, cutting-edge technique on this day.
- Linear learning machines and Kernel space, Making Kernels and working in feature space
- Demonstrate the working of SVM classification and regression problems using a business case in R.
Decision Trees
in machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any one single algorithm. The basics of ensembles are bagging & boosting that will be covered in detail and later progress with machine learning methods that use either or both approaches to build ensemble models.
- Bagging & boosting and its impact on bias and variance
- 0 boosting
- Random forest
- Gradient Boosting Machines and XGBoost which are the very popular winning recipe of data science competitions.
- Architecting ML solutions
Clustering:
You will learn the most commonly used unsupervised learning algorithm – Clustering.
- Different clustering methods; review of several distance measures
- Iterative distance-based clustering;
- Dealing with continuous, categorical values in K-Means
- Constructing a hierarchical cluster, K-Medoids, k-Mode and density based clustering to handle different data types in practice.
- Test for stability check of clusters.
- Hands-on implementation of each of these methods will be conducted in R.
- Business case analysis
- The objective of this session is to provide an application and end-to-end view of solving a Data Science problem and defend your analysis.
- We provide a business case in advance in which you will be required to apply all the data pre-processing steps and prepare the input for one or more ML algorithms learnt thus far.
- The lab is designed such that everyone participates in the discussion, design the solution approach for the given business case and defend the analysis approach.
Text Mining:
- Introduction to the Fundamentals of information retrieval;
- TFandIDF
- Thinking about the math behind text; Properties of words; Vector Space Model
- Matrix factorization: SVD
- Text Indexing
- Inverted Indexes
- Boolean query processing
- Handling phrase queries, proximity queries o LSA