Questions tagged [impala]

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Introduction from the whitepaper Impala: A Modern, Open-Source SQL Engine for Hadoop:

INTRODUCTION

Impala is an open-source, fully-integrated, state-of-the-art MPP SQL query engine designed specifically to leverage the flexibility and scalability of Hadoop. Impala’s goal is to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability and flexibility of Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise. Impala’s beta release was in October 2012 and it GA’ed in May 2013. The most recent version, Impala 2.0, was released in October 2014. Impala’s ecosystem momentum continues to accelerate, with nearly one million downloads since its GA.

Unlike other systems (often forks of Postgres), Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, YARN, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload.

...

Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section 7 shows, for single-user queries, Impala is up to 13x faster than alter- natives, and 6.7x faster on average. For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average – or nearly three times faster on average for multi-user queries than for single-user ones.

References

2083 questions

votes

5 answers

How does impala provide faster query response compared to hive

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far. I am wondering if there are…

hadoop hive impala

asked May 26 '13 at 02:07

techuser soma

4,766
5
23
43

votes

2 answers

Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS. My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with…

apache-spark impala apache-drill

asked Jun 25 '13 at 06:18

user2306380

votes

4 answers

How to copy all hive table from one Database to other Database

I have default db in hive table which contains 80 tables . I have created one more database and I want to copy all the tables from default DB to new Databases. Is there any way I can copy from One DB to Other DB, without creating individual…

hive hiveql impala

asked Oct 29 '14 at 17:21

Amaresh

3,231
7
37
60

votes

3 answers

Impala can't access all hive table

I try to query hbase data through hive (I'm using cloudera). I did a fiew hive external table pointing to hbase but the thing is Cloudera's Impala doesn't have an access to all those tables. All hive external tables appear in the metastore manager…

hadoop hive cloudera hue impala

asked Dec 10 '13 at 16:44

Nosk

votes

3 answers

Difference between invalidate metadata and refresh commands in Impala?

I saw at this link which affects Impala version 1.1: Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement. Does this still hold true for later versions of Impala?

impala invalidation

asked Feb 15 '17 at 01:24

covfefe

2,485
8
47
77

votes

2 answers

how to efficiently move data from Kafka to an Impala table?

Here are the steps to the current process: Flafka writes logs to a 'landing zone' on HDFS. A job, scheduled by Oozie, copies complete files from the landing zone to a staging area. The staging data is 'schema-ified' by a Hive table that uses the…

hadoop apache-kafka flume impala

asked Jan 25 '16 at 23:54

Alex Woolford

4,433
11
47
80

votes

1 answer

How to calculate seconds between two timestamps in Impala?

I do not see an Impala function to subtract two datestamps and return seconds (or minutes) between the two. http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_datetime_functions.html

sql hadoop hive impala

asked Mar 07 '16 at 19:55

ADJ

4,892
10
50
83

votes

1 answer

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1) Right…

hadoop impala spark-structured-streaming cloudera-cdh

asked Feb 06 '20 at 08:24

Victor

2,450
2
23
54

votes

6 answers

RODBC ERROR: Could not SQLExecDirect in mysql

I have been trying to write an R script to query Impala database. Here is the query to the database: select columnA, max(columnB) from databaseA.tableA where columnC in (select distinct(columnC) from databaseB.tableB ) group by columnA order by…

mysql r impala

asked May 11 '15 at 12:46

Gowtham Ganesh

votes

3 answers

How does computing table stats in hive or impala speed up queries in Spark SQL?

For increasing performance (e.g. for joins) it is recommended to compute table statics first. In Hive I can do:: analyze table compute statistics; In Impala: compute stats

; Does my spark application (reading from…

apache-spark hive apache-spark-sql impala

asked Sep 22 '16 at 07:23

Raphael Roth

26,751
15
88
145

votes

2 answers

Big data signal analysis: better way to store and query signal data

I am about doing some signal analysis with Hadoop/Spark and I need help on how to structure the whole process. Signals are now stored in a database, that we will read with Sqoop and will be transformed in files on HDFS, with a schema similar…

hadoop apache-spark hive impala parquet

asked Apr 24 '16 at 10:24

Ameba Spugnosa

1,204
2
11
25

votes

1 answer

Can ETL informatica Big Data edition (not the cloud version) connect to Cloudera Impala?

We are trying do a proof of concept on Informatica Big Data edition (not the cloud version) and I have seen that we might be able to use HDFS, Hive as source and target. But my question is does Informatica connect to Cloudera Impala? If so, do we…

hadoop informatica cloudera-cdh informatica-powercenter impala

asked Dec 23 '15 at 21:11

sun_dare

1,146
2
13
33

votes

3 answers

Convert YYYYMMDD String to Date in Impala

I'm using SQL in Impala to write this query. I'm trying to convert a date string, stored in YYYYMMDD format, into a date format for the purposes of running a query like this: SELECT datadate, session_info FROM database WHERE datadate >=…

sql hadoop impala

asked Oct 08 '15 at 19:24

nxl4

votes

2 answers

Write pandas table to impala

Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file. However, I cannot find any information…

python hadoop hdfs impala

asked Sep 01 '15 at 17:52

SummerEla

1,902
3
26
43

votes

2 answers

Is there a way to turn off DESCRIBE in R dplyr sql

I'm using R shiny and dplyr to connect to a database and query the data in Impala. I do the following. con <- dbPool(odbc(), Driver = [DIVER], Host = [HOST], Schema = [SCHEMA], Port = [PORT], UID = [USERNAME], PWD = [PASSWORD]) table_foo <-…

sql r dplyr impala

asked Aug 12 '19 at 19:03

bink1time

2 3

…

99 100 Next