Questions tagged [impala]

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Introduction from the whitepaper Impala: A Modern, Open-Source SQL Engine for Hadoop:

INTRODUCTION

Impala is an open-source, fully-integrated, state-of-the-art MPP SQL query engine designed specifically to leverage the flexibility and scalability of Hadoop. Impala’s goal is to combine the familiar SQL support and multi-user performance of a traditional analytic database with the scalability and flexibility of Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise. Impala’s beta release was in October 2012 and it GA’ed in May 2013. The most recent version, Impala 2.0, was released in October 2014. Impala’s ecosystem momentum continues to accelerate, with nearly one million downloads since its GA.

Unlike other systems (often forks of Postgres), Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, YARN, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload.

...

Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section 7 shows, for single-user queries, Impala is up to 13x faster than alter- natives, and 6.7x faster on average. For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average – or nearly three times faster on average for multi-user queries than for single-user ones.

References

2083 questions
58
votes
5 answers

How does impala provide faster query response compared to hive

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far. I am wondering if there are…
techuser soma
  • 4,766
  • 5
  • 23
  • 43
43
votes
2 answers

Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS. My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with…
user2306380
  • 611
  • 1
  • 7
  • 10
26
votes
4 answers

How to copy all hive table from one Database to other Database

I have default db in hive table which contains 80 tables . I have created one more database and I want to copy all the tables from default DB to new Databases. Is there any way I can copy from One DB to Other DB, without creating individual…
Amaresh
  • 3,231
  • 7
  • 37
  • 60
21
votes
3 answers

Impala can't access all hive table

I try to query hbase data through hive (I'm using cloudera). I did a fiew hive external table pointing to hbase but the thing is Cloudera's Impala doesn't have an access to all those tables. All hive external tables appear in the metastore manager…
Nosk
  • 753
  • 2
  • 6
  • 24
16
votes
3 answers

Difference between invalidate metadata and refresh commands in Impala?

I saw at this link which affects Impala version 1.1: Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement. Does this still hold true for later versions of Impala?
covfefe
  • 2,485
  • 8
  • 47
  • 77
14
votes
2 answers

how to efficiently move data from Kafka to an Impala table?

Here are the steps to the current process: Flafka writes logs to a 'landing zone' on HDFS. A job, scheduled by Oozie, copies complete files from the landing zone to a staging area. The staging data is 'schema-ified' by a Hive table that uses the…
Alex Woolford
  • 4,433
  • 11
  • 47
  • 80
13
votes
1 answer

How to calculate seconds between two timestamps in Impala?

I do not see an Impala function to subtract two datestamps and return seconds (or minutes) between the two. http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_datetime_functions.html
ADJ
  • 4,892
  • 10
  • 50
  • 83
12
votes
1 answer

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1) Right…
Victor
  • 2,450
  • 2
  • 23
  • 54
12
votes
6 answers

RODBC ERROR: Could not SQLExecDirect in mysql

I have been trying to write an R script to query Impala database. Here is the query to the database: select columnA, max(columnB) from databaseA.tableA where columnC in (select distinct(columnC) from databaseB.tableB ) group by columnA order by…
Gowtham Ganesh
  • 340
  • 1
  • 2
  • 12
11
votes
3 answers

How does computing table stats in hive or impala speed up queries in Spark SQL?

For increasing performance (e.g. for joins) it is recommended to compute table statics first. In Hive I can do:: analyze table compute statistics; In Impala: compute stats
; Does my spark application (reading from…
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
11
votes
2 answers

Big data signal analysis: better way to store and query signal data

I am about doing some signal analysis with Hadoop/Spark and I need help on how to structure the whole process. Signals are now stored in a database, that we will read with Sqoop and will be transformed in files on HDFS, with a schema similar…
Ameba Spugnosa
  • 1,204
  • 2
  • 11
  • 25
11
votes
1 answer

Can ETL informatica Big Data edition (not the cloud version) connect to Cloudera Impala?

We are trying do a proof of concept on Informatica Big Data edition (not the cloud version) and I have seen that we might be able to use HDFS, Hive as source and target. But my question is does Informatica connect to Cloudera Impala? If so, do we…
sun_dare
  • 1,146
  • 2
  • 13
  • 33
11
votes
3 answers

Convert YYYYMMDD String to Date in Impala

I'm using SQL in Impala to write this query. I'm trying to convert a date string, stored in YYYYMMDD format, into a date format for the purposes of running a query like this: SELECT datadate, session_info FROM database WHERE datadate >=…
nxl4
  • 714
  • 2
  • 8
  • 17
11
votes
2 answers

Write pandas table to impala

Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file. However, I cannot find any information…
SummerEla
  • 1,902
  • 3
  • 26
  • 43
9
votes
2 answers

Is there a way to turn off DESCRIBE in R dplyr sql

I'm using R shiny and dplyr to connect to a database and query the data in Impala. I do the following. con <- dbPool(odbc(), Driver = [DIVER], Host = [HOST], Schema = [SCHEMA], Port = [PORT], UID = [USERNAME], PWD = [PASSWORD]) table_foo <-…
bink1time
  • 383
  • 1
  • 5
  • 15
1
2 3
99 100