Hadoop eco system tools

                          Hadoop, Hive, Hbase, Zookeeper: 


Hadoop is not a database Engine. Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS.
Furthermore Map Reduce is not technology, it is a Model to where you can work parallel on HDFS data.

HBase is a NoSQL datastore that runs on top of your existing Hadoop cluster(HDFS). It provides you capabilities like random, real-time reads/writes, which HDFS being a FS lacks. Since it is a NoSQL datastore it doesn't follow SQL conventions and terminologies. HBase provides a good set of APIs(includes JAVA and Thrift). Along with this HBase also provides seamless integration with MapReduce framework. But, along with all these advantages of HBase you should keep this in mind that random read-write is quick but always has additional overhead. So think well before ye make any decision.

ZooKeeper is a high-performance coordination service for distributed applications(like HBase). It exposes common services like naming, configuration management, synchronization, and group services, in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.

HBase relies completely on Zookeeper. HBase provides you the option to use its built-in Zookeeper which will get started whenever you start HBAse. But it is not good if you are working on a production cluster. In such scenarios it's always good to have a dedicated Zookeeper cluster and integrate it with your HBase cluster.

Note : You should always have odd number of nodes in your ZK Quorum

--------------================----------------=============--------------=================--------------------
When talking about installing HBase in fully distributed mode we'll be addressing the following:

HDFS: A running instance of HDFS is required for deploying HBase in distributed mode.

HBase Master: HBase cluster has a master-slave architecture where the HBase Master is responsible for monitoring all the slaves i.e. Region Servers.

Region Servers: These are the slave nodes responsible for storing and managing regions.

Zookeeper Cluster: A distributed Apache HBase installation depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble.
--------------============-------------=============-------------================--------------------------
Hive (a data warehouse for Hadoop with an SQL interface) and Pig (a high level language for ad-hoc analysis).

Comments

Popular posts from this blog

Setup Nginx as a Reverse Proxy for Thingsboard running on different port/server

How to auto re-launch a YARN Application Master on a failure.

Hive partitioned tables Issue with schema & PrestoDB