Description
This course is designed make the participant proficient in Hadoop, Map-reduce with hands-on. It also covers various Eco-Systems (Hive, Pig, Sqoop, Flume)
Pre-Requisites:
Any of the following are the pre-requisites for Hadoop Development course. Core java with collections
Linux commands
SQL
Module 1: Bigdata Concepts
- Understand big data, challenges, distributed environment.
- Aware of hadoop and sub projects.
- Introduction
- Data
- Storage
- Bigdata
- Distributed environment
- Hadoop introduction
- History
- Environment
- Benefits
- Hadoop Components / Eco-Systems
- Cluster Deployment
- Pseudo Vs Fully Distributed
- Arranging cluster for practice
- Cloudera cluster environment
Module 2: HDFS
Should aware of HDFS Components, Namenode, Datanode
Aware of storing and maintaining data in cluster, reading and writing data to/from cluster.
- Able to maintain files in HDFS
- Able to access data from HDFS through java program
- HDFS Architecture HDFS Shell
- NameNode FS Shell Commands
- Datanode Uploading & Downloading
- Fault Tolerence Directory Handling
- Read&Write operations File Handling
- Interfaces(Command line interface, JSP, Use cases API) Using Hue for browsing data.
Module 3: Map-Reduce (for Java Programmers)
- Understand Map-Reduce paradigm and Yarn Architecture. Analyze a given problem in map-reduce pattern. Able to Implement map-reduce applications
- Map-Reduce Introduction Yarn Architecture
- Map-Reduce Architecture Designing and application on MR
- Work Flow of MR Program Implementation
- Placement of components on cluster Detailed description of M-R Methods
- MR on HDFS key/value pairs
Module 4: Data Ingestion:
- Understand the Data ingestion and types
- Recognize various Data Ingestion tools
- Hive Architecture
- Introduction
- Types of Data Ingestion
- Ingesting Batch Data
- Ingesting Streaming Data
- Use Cases
Module 5: Apache Sqoop
- Understand Sqoop architecture and uses
- Able to load real-time data from an RDBMS table/Query on to HDFS
- Able to write sqoop scripts for exporting data from HDFS onto RDMS tables
- Introduction Sqoop-importall Sqoop Architecture Integrating with Hive Connect to MySQL database Export Sqoop – Import Eval Importing to specific location Joins Querying with import Use Cases
Module 6: Apache Flume
- Understand Flume architecture and uses
- Able to create flume configuration files to stream and ingest data onto HDFS
- Introduction Creation of Flume configuration files
- Flume Architecture Streaming local disk
- Flume master Streaming web / Social Networking
- Flume Agents Examples
- Flume Collectors Use Cases
Module 7: Data transformation (PIG):
- Understand data types, data model, and modes of execution.
- Able to store the data from a Pig relation on to HDFS.
- Able to load data into Pig Relation with or without schema.
- Able to split, join, filter, and transform the data using pig operators Able to write pig scripts and work with UDFs.
- Introduction Join
- Pig Data Flow Engine Order
- Map Reduce Vs. Pig Flatten
- Data Types Cogroup
- Basic Pig Programming Illustrate
- Modes of execution in PIG Explain
- Loading Parameter substitution
- Storing Creating simple UDFs in Pig
- Group Use Cases.
- Filter
Module-8: Hive & HCatalog:
Understand the importance of Hive, Hive Architecture
Able to create Managed, External, Partitioned and Bucketed Tables Able to Query the data, perform joins between tables Understand storage formats of Hive Understand Vectorization in Hive
Introduction
Hive Vs. RDBMS
HiveQL and Shell
Data Types
Schemas
Hive Commands
Hive Tables
Managed Tables
External Tables
Loading
Queries
Inserting from other tables
Partitions
Loading into partitions
Dynamic partitioning
Bucketing
Joins
Views
SortBy
Distribute by
HCatalog
Using HCatStorer
HCatLoader
Use Cases
Module-9: Introduce other Eco-systems
Able to migrate to other eco-systems
Hue
Oozie
Architecture
Script
Use Cases
Zoo-Keeper
Architecture
Use Cases