Posts

Showing posts from 2016

Add extra Hard Disk/Phisical memory at Data Node

                             Add extra Hard Disk/Phisical memory at datanode As we are processing data on hadoop cluster we need to configure many things. The main thing is to manage temporary files which is generation in between map reduce. We can set map-reduce intermediate compression and map-reduce output compression. After setting every things we need atleat double/tripple space on datanode with respect to existing dataset on HDFS. So we can add some memory on hadoop cluster in 2 ways: 1. Add 1 or more datanode datanode 2. Add harddisk on the exising datanodes How to add extra HDD: First add a HardDisk on all datanode machine and mount on a point/name(example: /hdd2) Now create datanode directory in new harddisk (example: /hdd2/datanode) Now change ownership of /hdd2/datanode to hdfs/hadoop user then Stop one datanode and add the new Harddisk location in hdfs-site.xml <property> <name>dfs.datanode.data.dir</name> <value>/hdd/datanode,

IMPORT RDBMS TABLE IN HDFS AS ORC FILE

Sqoop support only few file format(text, sequence, avro..etc), And If you want to store RDBMS data in HDFS in ORC(which is very compressed and fast file format as facebook said & used) you need to do this task in 2 steps. 1st import RDBMS data as text file and then Insert that data in ORC formatted table. (NOTE: We can do this using spark also). Here I am explaining how to do this using sqoop betch. I am using cdh5.4.0-hadoop-2.6, chd5.4.0-hive, apache-sqoop1.4.2 Hope you have all installed, you can do this with apache hadoop, hive also but some time it gives error because of version dependency. As per my knowledge If you can metch perfect hadoop& hive version then you'll not get any kind of error, other wise you have to face many error since apache foundation continuously improving every tools. If you are not sure then best to go with CHD, you can download tall file & install saperately. http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.0.tar.gz http://archiv

How to auto re-launch a YARN Application Master on a failure.

1)Use Case: The fundamental idea of Hadoop2 (Map-Reduce + Yarn) is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager ( RM ) and per-application ApplicationMaster ( AM ). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager ( NM ), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. The ResourceManager has two main components: 1. Scheduler: is responsible for allocating resources to the various running applications. 2. Appl