14

Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself.

I saw this and also another duplication

but they are related directly to Hive, it wont work for Athena. Ideally I am looking for a way to do it programmatically without the need to define it at the console.

Community
  • 1
  • 1
NetanelRabinowitz
  • 1,534
  • 2
  • 14
  • 26
  • It would be more productive to try it (or at least read the documentation - http://docs.aws.amazon.com/athena/latest/ug/creating-tables.html) than to decide that it won't work. – David דודו Markovitz Mar 29 '17 at 17:43
  • 1
    read and tried, Athena only support CREATE EXTERNAL TABLE, the suggested solution is based on CREATE TABLE for the avro schema and only than create external one for the parquet – NetanelRabinowitz Mar 29 '17 at 22:15
  • So create an external table. Why should it matter? – David דודו Markovitz Mar 29 '17 at 22:40
  • 2
    I don't know thats exactly the reason I post this question.. Tried to change it to an external table but it failed to create it with this kind of syntax. If you have a working an example on Athena you are more than welcome to share :) – NetanelRabinowitz Mar 30 '17 at 17:34
  • Unfortunately I have access to Presto but not to Athena, otherwise I would supply an answer way before... :-) Can you share the error you got when you tried using external table? – David דודו Markovitz Mar 31 '17 at 10:48

2 Answers2

7

This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API.

If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is MSCK REPAIR TABLE.

Steve McKay
  • 2,123
  • 17
  • 26
  • 1
    Glue crawls the data source. Want create table only. Athena does not need glue just schema copy/paste etc which is already defined in the parquet. Product managers at all? – mathtick Sep 10 '19 at 09:54
2

It doesn't seem to be possible with Athena as avro.schema.url is not a supported property.

table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)

You can use avro.schema.literal (you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards.

Strange errors like: SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause

andresp
  • 1,624
  • 19
  • 31
  • 1
    The "strange error" you mentioned seems to be a bug in presto: https://github.com/prestodb/presto/issues/8654 – fiorix Oct 16 '17 at 23:03