Description
Course Outline:
1.Introduction Data Science
- CRISP – DM Framework
- Technology stack for Data Science
2.RDBMS (Oracle ) with SQL
- SQL Introduction (DDL, DML)
- Joins
- Views, Triggers and Procedures
- Advanced SQL for Analytics
3.Python programming
- Variables and data types
- Standard I/O
- Operators
- Control flow (if else, for, while, break and continue)
- Data Structures ( Lists, Tuples, Sets, Dictionary and Strings)
- Functions ( recursive, lambda functions, map, filter and reduce)
- Modules and Packages
- Working with Python Libraries ( OS, datetime, system)
- Exception Handling
- Object Oriented Programming ( Classes, Objects, oops )
4.Exploratory Data Analysis
- Basic statistics
- Hypothesis testing
- Data distributions (Central Limit Theorem )
- Introduction to visualization
- Plotting with Matplotlib and seaborn
- Introduction to Tableau for Reporting
- Percentiles and Quartiles
- IQR, box-plot and whiskers
- Bar Charts, Pie Charts, Line and Pair charts
- Uni variate, bi variate and multi variate analysis
- EDA case study
5.Python For Data Science
- Introduction to numpyand operations on numpy
- Getting started with Pandas and operations on pandas
- Sampling techniques
- Data Preprocessing with Pandas (excel, csv and pdf)
- Missing value analysis ( NULL value treatment)
- Data Normalization and standardization
- Outlier analysis and treatment
- Web scrapping using beautifulsoup, word clouds
6.Machine Learning with Python
a) Linear Regression:
- Algebra for regression
- Assumptions of Linear regression
- Multiple regression
- Feature Selection ( VIF and P-statistic)
- Model building
- Parameter tuning for regression
- Model validation ( Accuracy, Variance, R-squared)
- Bias variance tradeoff
- Case study on regression
b) Logistic Regression:
- Logistic regression intuition
- Sigmoid function, mathematics behind logistic regression
- Feature engineering and collinearity
- Regularization (L1 and L2) and parameter tuning
- Case study on logistic regression
c) Decision Trees:
- Decision trees introduction
- Homogeneity, GINI index and Information gain
- Building decision trees and parameter tuning
- Truncating and Pruning trees
- Random forest (ensembles)
- OOB (out of bag error)
- Cross validation, bagging and boosting (XG boost, ada boost and GBM)
- Case study on decision tree and random forest
d) K nearest neighbor for classification
e) Model deployment with PMML, H5 and pickle