How to query data from gz file of Amazon S3 using Qubole Hive query？

Question

I need get specific data from gz. how to write the sql? can I just sql as table database?:

  Select * from gz_File_Name where key = 'keyname' limit 10.

but it always turn back with an error.

score 1 · Accepted Answer · edited May 23 '17 at 12:32

1

You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:

create external table hive_schema.your_table (
col_one string, 
col_two string
)
stored as textfile  --specify your file type, or use serde
LOCATION
  's3://your_s3_path_to_the_folder_where_the_file_is_located'
;

See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable

To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

edited May 23 '17 at 12:32

Community

1
1

answered Mar 22 '17 at 08:05

leftjoin

36,950
8
57
116

the new table should build in S3? any where? – daxue Mar 22 '17 at 09:22
I think you are working with S3 because you have tagged your question amazon-s3. You may use HDFS instead if you have such option. Hive table can be created with location in S3 or HDFS. You have to put your file into table location. Hive will scan all files inside the table location. You can use only s3 or HDFS for this purpose. – leftjoin Mar 22 '17 at 10:00
Thanks...I wonder if HDFS is a storage base just as S3 or mysql? – daxue Mar 23 '17 at 02:58
And another question is, as I create an external table using Qubole.Is that mean,the table is in S3, the data is export from a dir of s3 to another? and now I can do query with the table.just as opration on mysql?(I only used mysql+plsql or oracle +plsql) – daxue Mar 23 '17 at 03:01
http://docs.qubole.com/en/latest/user-guide/features/explore/s3-explore.html I tried to do follow this page,but I cannot find how to find this view.. – daxue Mar 23 '17 at 03:12
I read some doc but still confused. 's3://your_s3_path_to_the_folder_where_the_file_is_located' is the path of the gzip file? the statement only let qubole point to the file but can see it in a table style? – daxue Mar 23 '17 at 05:47
The path to the folder where your files are located and you can upload them using s3cmd. Table in Hive needs folder location. When you query table in Hive, it scans files in table location in S3. You can insert into table using Hive and files will be created in the folder specified. The statement is not related to Qubole. It's Hive command. Qubole provides you with web interface, rest service and SDKs for running different commands including Hive commands. Also qubole manages clusters for you. Hive data are located in s3. – leftjoin Mar 23 '17 at 09:16
You wrote: "I tried to do follow this page,but I cannot find how to find this view" Try to switch in upper left corner to Explore then in drop-down list below to My Amazon S3 then chose your bucket... – leftjoin Mar 23 '17 at 09:22
amazing....to make sure...If the gzip data is updated , can I find it in the external table rather than upload it by hand from the gzip to the table? – daxue Mar 23 '17 at 09:33
Yes, you can upload and query and you can overwrite using Hive HQL and files will be modified. You can load data using HQL into empty table and Hive will create file(s). Hive supports a lot of file types an few compression codecs. Gzip is recognized automatically when Hive reads table files. There are configuration parameters in Hive for controlling output file sizes, compression, etc. – leftjoin Mar 23 '17 at 09:46
If I do not upload the data from gzip , then it will not be found in the table? – daxue Mar 23 '17 at 09:55
If you mean that what if you upload not gzipped file, then the answer is yes, Hive will read not gzipped as well along with gzipped files if any. If you will not upload file at all, hive query will not return rows. You can load data from other table using `insert overwrite .. select`. – leftjoin Mar 23 '17 at 10:28
1

Do not confuse Hive and Qubole. Hive is a database that provides you hql client, metadata for working with tables. Tables in Hive are built over locations in S3. Qubole is an interface between you and AWS, wrapper that simplifies cluster management and optimizes costs, it starts shut down, shrinks and increases on demand the AWS cluster, starts and shuts Hive, etc. So you do not need to pay for not running AWS cluster or maintain it yourself. You can access to S3 files using Qubole interface, using AWS s3cmd commands and using Hive query language or other tools. – leftjoin Mar 23 '17 at 10:29
I think you didnot understand my question..I mean,If the gzip file is related to the external table? means,if the gzip data is added, so the table's data will been ''Automatically'' added , too? – daxue Mar 24 '17 at 01:49
Data are in files. Hive itself stores only metadata and statistics in it's metastore. External or managed, when you adding files and query table, Hive scans all files in the folder. When you drop external table the files remain where they are. – leftjoin Mar 24 '17 at 07:57

How to query data from gz file of Amazon S3 using Qubole Hive query？

1 Answers1