Posts

Showing posts from 2017

Hive partitioned tables Issue with schema & PrestoDB

It's very strange for Hive and PrestoDB user that the schema of partitioned tables in Hive is defined on partition level as well. Partition level schema and table level schema in Apache Hadoop is letting complex. Let's see the details in below example: Table schema In Hive you can change the schema of an existing table. Let’s say you have a table: CREATE TABLE TEST1 (ID INT,  NAME STRING,  RATING INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LINES TERMINATED BY '\n' STORED AS TEXTFILE; We will focus on the third column named rating. We load few records for hdfs to this table. The file looks like this: 1         john   3.5 2   Dav   4.6 3   andy   5 hive -e "load data inpath 'input.txt' into table test1" The third column has some decimal values, but we have defined this column as integer, so the we won’t see the decimal part in the data: hive> select * from test1; OK

Installation of Presto DB & Client Connection with Presto

Installation of Presto DB & Client Connection with Presto As we Already discussed about Presto DB that it is a distributed analytical query engine to run sql kind of query on data warehouse. So lets see the installation of Presto DB. Single node Presto DB Installation:  Here we will install Presto DB on single node Linux machine https://prestodb.io/docs/current/installation/deployment.html Multi node Presto DB Installation: Here we will install Presto DB on Three node Linux machine or the same can be install on existing Hadoop Cluster to run query on hive data. https://prestodb.io/docs/current/installation/deployment.html Client Connection with Presto: Presto DB client can be downloaded from Presto DB site: https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.161/presto-cli-0.161-executable.jar

PrestoDb: A open source distributed SQL query engine

Presto DB Power full Query Engine Presto DB is an open source distributed query engine to run interactive SQL(analytics query) on Big-Data which can be gigabytes to terabytes or petabytes.  Presto was designed for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. Presto allows to querying data from Hadoop HDFS, Hive, Cassandra, relational databases or even proprietary data stores. A single presto query can combine data from multiple sources. The main goal of Presto to deliver analytics query result in sub-seconds to minutes on non-expensive hardware like hadoop cluster. It's fully free. Facebook uses Presto for interactive query against several internal data stores, including their 300 PB data warehouse. I personally tried presto DB on 3 node cluster with the data size of 1 TB to 3 TB data which resides on Hadoop HDFS and got the awesome performance in sub-seconds(calculations)