Hadoop Tutorial | Spark Kafka Nosql and BigData tools for DWH

Posts

Showing posts from 2014

Solr document update

June 05, 2014

updating csv document in solr [user@localhost exampledocs]$ curl http://localhost:8983/solr/update/csv --data- binary @books.csv -H 'Content-type:text/plain; charset=utf-8' after above command you will see like this: <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">60</int></lst> </response> [user@localhost exampledocs]$ curl http://localhost:8983/solr/update/csv --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8' <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">61</int></lst> </response> [user@localhost exampledocs]$ means data is updating.. i am also working on this so i'll change/upd

Apache Solr Installation

June 05, 2014

Solr Installation and New core Configuration: Apache Solr achieve fast search responses because, instead of searching the text directly, it searches an index. This is like retrieving pages in a book based on a word by scanning the index at the back of a book, as opposed to searching every word of every page of the book. This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages). Solr stores this index in a directory called index in the data directory. Apache Solr is powered by Lucene, a powerful open-source full-text search library, under the hood. The relationship between Solr and Lucene, is like that of the relationship between a car and engine. In Solr, a Document is the unit of search and index. An index consists of one or more Documents, and a Document consists of one or more Fields. Solr Installation and New core Configuration: 1. Installation- Download so

Hadoop Interview Questions

May 23, 2014

For Hadoop Interview Question See: Hadoop Interview Questions and Answers by rohit kapa from kapa rohit

How to install apache sqoop

May 16, 2014

Sqoop Installation: Sqoop is usefull for import/export sql or rdbms data files or table in hadoop HDFS or in Nosql. Download Sqoop 1.4.2 $ wget http://www.eu.apache.org/dist/sqoop/1.4.2/sqoop- 1.4.2.bin__hadoop-1.0.0.tar.gz Extract Sqoop 1.4.2 tarball, $ tar -zxvf sqoop-1.4.2.bin__hadoop-1.0.0.tar.gz Update the hadooprc.sh with export SQOOP_HOME $ cat hadooprc.sh export SQOOP_HOME=/data/sqoop-1.4.2 export JAVA_HOME=/usr/java/jdk1.7.0_05 export HADOOP_HOME=/data/hadoop-1.0.1 export HBASE_HOME=/data/ahbase-0.94.3 export PATH=$PIG_HOME/bin:$SQOOP_HOME/bin:$HIVE_HOME/bin:$JAVA_HOME/bin:$HA DOOP_HOME/bin:$PATH:$HBASE_HOME/bin export CLASSPATH=$JAVA_HOME:/data/hadoop-1.0.1/hadoop-core- 1.0.1.jar:$PIG_HOME/pig-0.10.0.jar Download the Mysql connector jar from, http://dev.mysql.com/downloads/connector/j/ extract it, copy the mysql-connector-java-5.1.22-bin.jar into SQOOP_HOME/lib Install Mysql-Server $sudo yum install mysql-server $sudo ser

How to install apache PIG

May 16, 2014

Installation of PIG: 1. untar file tar -zxvf filename 2. go inside and export JAVA_HOME, HADOOP_HOME export JAVA_HOME=/home/user/jdk1.7-45 export HADOOP_HOME=/home/user/hadoop-1.2.1 3. start pig bin/pig 4. run commands on grunt prompt. grunt> a = load '/first' as (id:int,name:chararray,city:chararray); grunt> b = filter a by city == 'chennai'; grunt> c = foreach b generate name; grunt> dump c; grunt> store c into '/output'; grunt> b = foreach a generate id+1000; grunt> dump b; grunt> fs -copyFromLocal first / grunt> a = load '/first' as (id:int,name:chararray,city:chararray); grunt> b = foreach a generate city; grunt> dump b; grunt> ab = load '/infile.txt' using PigStorage(' ') as (first:chararray, last:chararray, age:int, dept:chararray); grunt> b = FILTER ab BY last=='kumar';

Cassandra vs Hbase

May 16, 2014

Hbase Vs Cassandra: Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity. There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon's DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their

Hbase Program in eclipse

May 16, 2014

Hbase Program to work On IDE like Eclipse: import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.KeyValue; import org.apache.hadoop.hbase.MasterNotRunningException; import org.apache.hadoop.hbase.ZooKeeperConnectionException; import org.apache.hadoop.hbase.client.Delete; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.util.Bytes; public class HbaseTest { private static Configuration conf = null;

Hadoop eco system tools

May 16, 2014

Hadoop, Hive, Hbase, Zookeeper: Hadoop is not a database Engine. Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS. Furthermore Map Reduce is not technology, it is a Model to where you can work parallel on HDFS data. HBase is a NoSQL datastore that runs on top of your existing Hadoop cluster(HDFS). It provides you capabilities like random, real-time reads/writes, which HDFS being a FS lacks. Since it is a NoSQL datastore it doesn't follow SQL conventions and terminologies. HBase provides a good set of APIs(includes JAVA and Thrift). Along with this HBase also provides seamless integration with MapReduce framework. But, along with all these advantages of HBase you should keep this in mind that random read-write is quick but always has additional overhead. So think well before ye make any decision. ZooKeeper is a high-performance coordination service for distributed applications(like HBase). It exposes common ser

How to install apache hive | Hive Installation

May 16, 2014

Hive Installation: 1. untar file tar -zxvf filename 2. go inside and export JAVA_HOME, HADOOP_HOME export JAVA_HOME=/home/user/jdk1.7-45 export HADOOP_HOME=/home/user/hadoop-1.2.1 3. start hive bin/hive 4. show databses show tables

Hadoop with Eclipse tool

May 13, 2014

Hadoop Configuration with Eclipse IDE for development 1. Go inside user dir and if you installed eclipse as root cd /opt/eclipse/plugins/ 2. Put plugins cp hadoop-eclipse-plugin-1.0.4 /opt/eclipse/plugins/ 3. Start hadoop 4. Start eclipse(you may start from shell write eclipse and close shell) 5. Right click on mapreduce location beside consol in eclipse and click on new hadoop location. 6. On define hadoop location prompt fill location name=111.111.1.111 map/reduce master dfs master host=111.111.1.111 port=50310 port 50311 click finish ur hdfs is configured 7. In updation right click on dfs location tab and click refresh/reconnect. 8. file->new project-> Mapreduce Project->projectName-> (configure hadoop location where you installed it.) then write mapreduce/wordcount program.

Hadoop wordcount MapReduce program

May 13, 2014

WordCount Program in MapReduce package org.dkrajput.mr; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount2 { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (

May 13, 2014

Hadoop Installation: You can configure hadoop in any Operating System, but the main power of hadoop comes with linux. 1. configure on windows - using cigwin - using virtual machine - install ubuntu. see below link. (http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SingleNodeSetup.html) 2. configure on Linux machine. /home/Admin/ su - root password: root adduser dav passwd dav (now my user name and password both are set as dav) su - dav password: dav (now i am login with my user) pwd /home/dav/ step1: download hadoop - Use the latest stable release from Hadoop site and I used hadoop-1.2.1.tar.gz, it tarball file. step2: Install java jdk1.6 or jdk1.7 or just put it here as it is. step3: tar -zxvf hadoop-1.2.1.tar.gz (now it) mkdir hdfstmp (it will create hdfstmp dir here) go inside the hadoop directory and edit following files: cd hadoop-1.2.1 vi conf/hadoop-env.sh remove # tag from JAVA_HOME=/home/

May 13, 2014

hadoop architecture Input Data -(Split) -(Record Reader) Mapper(tokenize in strings)-[a,1][a,1][b,1], [a,1][b,1] Combiner(combine similer)-[a,1,1][b,1], [a,1][b,1] Partitioning(partition on basis of similarity)[a,1,1], [b,1], [a,1], [b,1] Shuffle and sort(shuffle phase sorts the resulting pairs from the combiner phase, after which, data goes to reducer)[a,1,1,1], [b,1,1] Reducer.[a,3], [b,2] for more info see this https://developer.yahoo.com/hadoop/tutorial/module4.html

May 06, 2014

Index to learn BigData Hadoop Fremework Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users.It is licensed under the Apache License 2.0. The Apache Hadoop framework is composed of the following modules: Hadoop Common – contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications. Hadoop MapReduce – a programming model for large scale data processing. Hadoop HDFS What is HDFS(Hadoop Distributed File System)? HDFS Architecture, File System, Im

May 02, 2014

What is Big Data * Every day, world create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. * Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. *According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is also Big Data. * Huge Competition in the Market: - Retails – Customer analytics - Travel – travel pattern of the customer - Website – Understand users navigation pattern, interest, conversion, etc - Sensors, satellite, geospati