IMPORT RDBMS TABLE IN HDFS AS ORC FILE

Sqoop support only few file format(text, sequence, avro..etc), And If you want to store RDBMS data in HDFS in ORC(which is very compressed and fast file format as facebook said & used) you need to do this task in 2 steps. 1st import RDBMS data as text file and then Insert that data in ORC formatted table. (NOTE: We can do this using spark also).

Here I am explaining how to do this using sqoop betch. I am using cdh5.4.0-hadoop-2.6, chd5.4.0-hive, apache-sqoop1.4.2

Hope you have all installed, you can do this with apache hadoop, hive also but some time it gives error because of version dependency. As per my knowledge If you can metch perfect hadoop& hive version then you'll not get any kind of error, other wise you have to face many error since apache foundation continuously improving every tools.

If you are not sure then best to go with CHD, you can download tall file & install saperately.
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.0.tar.gz
http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.4.0.tar.gz

Lets do some hands on:

1. Create first table which will store data as TEXT file

CREATE EXTERNAL TABLE MY_ITEM1(MY_ITEM_ID decimal(10,0),MY_COMMENTS string,MY_USR_ID decimal(10,0),MY_ITEM_IS decimal(3,0),MY_ITEM_DATE timestamp) STORED AS TEXTFILE LOCATION '/user/hive/warehouse/my-db/my_item1/';

2. Create first table which will store data as ORC file

CREATE EXTERNAL TABLE MY_ITEM2(MY_ITEM_ID decimal(10,0),MY_COMMENTS string,MY_USR_ID decimal(10,0),MY_ITEM_IS decimal(3,0),MY_ITEM_DATE timestamp) STORED AS ORC LOCATION '/user/hive/warehouse/my-db/my_item2/';

3. Now do sqooping and import data in Hive HDFS. Here I have some extra parameter with sqoop to read data in "" format so that data don't mix.

bin/sqoop-import --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=hostname)(port=1521))(connect_data=(service_name=sip)))" --username username --password 'password' --table OracleTable1 --hive-table MY_ITEM1 --target-dir /user/hive/warehouse/my-db/my_item1/ --where "trunc(MY_ITEM_DATE)>='08-FEB-2016' AND trunc(MY_ITEM_DATE)<'09-FEB-2016'" --num-mappers 2 --fields-terminated-by ',' --lines-terminated-by '\n' --hive-drop-import-delims --escaped-by \\ --enclosed-by '\"' --append

4. Now insert data from txt to orc formated table.

hive>insert into table my_item2 select * from my_item1;

5. After that delete first(text formated) table

dfs -rmr /user/hive/warehouse/my-db/my_item1;

Thats it..... :-)

I creae the same table in 2 format ORC & TEXT File Formate and got amazing stats:

ORC FILE    150 M  A Group by ComputationRead Time: 30 seconds
TEXT FILE    700 M  A Group by ComputationRead Time: 42 seconds

============================================

Comments

Popular posts from this blog

Setup Nginx as a Reverse Proxy for Thingsboard running on different port/server

How to auto re-launch a YARN Application Master on a failure.

Read JSON File in Cassandra