Hadoop Tutorial | Spark Kafka Nosql and BigData tools for DWH

Posts

PrestoDb: A open source distributed SQL query engine

January 04, 2017

Presto DB Power full Query Engine Presto DB is an open source distributed query engine to run interactive SQL(analytics query) on Big-Data which can be gigabytes to terabytes or petabytes. Presto was designed for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. Presto allows to querying data from Hadoop HDFS, Hive, Cassandra, relational databases or even proprietary data stores. A single presto query can combine data from multiple sources. The main goal of Presto to deliver analytics query result in sub-seconds to minutes on non-expensive hardware like hadoop cluster. It's fully free. Facebook uses Presto for interactive query against several internal data stores, including their 300 PB data warehouse. I personally tried presto DB on 3 node cluster with the data size of 1 TB to 3 TB data which resides on Hadoop HDFS and got the awesome performance in sub-seconds(calculations) ...

Add extra Hard Disk/Phisical memory at Data Node

August 02, 2016

Add extra Hard Disk/Phisical memory at datanode As we are processing data on hadoop cluster we need to configure many things. The main thing is to manage temporary files which is generation in between map reduce. We can set map-reduce intermediate compression and map-reduce output compression. After setting every things we need atleat double/tripple space on datanode with respect to existing dataset on HDFS. So we can add some memory on hadoop cluster in 2 ways: 1. Add 1 or more datanode datanode 2. Add harddisk on the exising datanodes How to add extra HDD: First add a HardDisk on all datanode machine and mount on a point/name(example: /hdd2) Now create datanode directory in new harddisk (example: /hdd2/datanode) Now change ownership of /hdd2/datanode to hdfs/hadoop user then Stop one datanode and add the new Harddisk location in hdfs-site.xml <property> ...

IMPORT RDBMS TABLE IN HDFS AS ORC FILE

February 10, 2016

Sqoop support only few file format(text, sequence, avro..etc), And If you want to store RDBMS data in HDFS in ORC(which is very compressed and fast file format as facebook said & used) you need to do this task in 2 steps. 1st import RDBMS data as text file and then Insert that data in ORC formatted table. (NOTE: We can do this using spark also). Here I am explaining how to do this using sqoop betch. I am using cdh5.4.0-hadoop-2.6, chd5.4.0-hive, apache-sqoop1.4.2 Hope you have all installed, you can do this with apache hadoop, hive also but some time it gives error because of version dependency. As per my knowledge If you can metch perfect hadoop& hive version then you'll not get any kind of error, other wise you have to face many error since apache foundation continuously improving every tools. If you are not sure then best to go with CHD, you can download tall file & install saperately. http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.0.tar.gz http://archiv...

How to auto re-launch a YARN Application Master on a failure.

January 30, 2016

1)Use Case: The fundamental idea of Hadoop2 (Map-Reduce + Yarn) is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager ( RM ) and per-application ApplicationMaster ( AM ). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager ( NM ), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. The ResourceManager has two main components: 1. Scheduler: is responsible for allocating resources to the various running applications. 2. Appl...

Run Linux shell script in every few minutes/hours/monthly ...

September 08, 2015

Q: How can I run Linux shell script in every n minute/hours/monthly A: By using while loop in script or 'CRON' 'CRON' is best Choice Cron is a daemon found on most Unix/linux systems that runs scheduled commands at the specified intervals. You add a script to the list by copying it to the folder of your choice: cron.daily cron.hourly cron.monthly cron.weekly These folders are typically found in /etc OR Just type below command on consol and editor will open, $crontab -e In this there ia a line like # * * * * * command Remove # from the begining of The line and set like you want for 10 minute */10 * * * * ./script for 2 hours 0 */2 * * * here ./script is may be linux command or a script. By using this we can RUN Hadoop/Hive/PIG scripts on every Interval which we want. =================xxxxxxxxxxxxxxxx============================ My Scripts ============================================================= script.sh file in Home d...

MongoDB Replication Configuration

January 14, 2015

MongoDB Replication Configuration: while replication in mongodb if it show error like "not master" then run rs.slaveOk() command on secondary node side it will solve issue. ==================================================== Replication- mongoDB 1.Start by creating a data directory for each replica set member: mkdir /data/node1 mkdir /data/node2 mkdir /data/arbiter 2. mongod --replSet myapp --dbpath /data/node1 --port 40000 mongod --replSet myapp --dbpath /data/node2 --port 40001 mongod --replSet myapp --dbpath /data/arbiter --port 40002 3. run mongo hostname:40000 to rum client on primary, and then run the rs.initiate() command: > rs.initiate() { "info2" : "no configuration explicitly specified -- making one", "me" : "arete:40000", "info" : "Config now saved locally. Should come online in about a minute .", "ok" : 1 } 4.You can now add the o...

What is MongoDB

January 14, 2015

MongoDB - Document Oriented NoSQL DataBase: MongoDB is one of several database types to arise in the mid-2000s under the NoSQL banner. Instead of using tables and rows as in relational databases , MongoDB is built on an architecture of collections and documents. Documents comprise sets of key-value pairs and are the basic unit of data in MongoDB. Collections contain sets of documents and function as the equivalent of relational database tables. Like other NoSQL databases, MongoDB supports dynamic schema design, allowing the documents in a collection to have different fields and structures. The database uses a document storage and data interchange format called BSON, which provides a binary representation of JSON -like documents. Automatic sharding enables data in a collection to be distributed across multiple systems for horizontal scalability as data volumes ...