0

I have created a table to perform queries from S3 data(JSON format, which is the output from SageMaker) using Athena.

The file extension is: "filename.json.out". The formatting is in "JSON" format.

The files are coming from the SageMaker batch transform job. The table that I have been trying to create has the format like this:

enter image description here

the "col10" key is basically an array of string input like "col10": ["value1", "value2"]

Now, I've tried to create a table in a database like below:

CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
`data` array<struct<
    `col1`: string,
    `col2`: string,
    `col3`: string,
    `col4`: string,
    `col5`: string,
    `col6`: string,
    `col7`: string,
    `col8`: string,
    `col9`: string,
    `col10`: array,
    `col11`: string,
    `col12`: string,
    `col13`: string,
    `col14`: string,
    `col15`: string
>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket_name/object/'
TBLPROPERTIES ('has_encrypted_data'='false');

After creating the table, I tried to see the records by using the query below:

select * from db_name.table_name limit 10

Whenever I try to run the "select" query above, I find the error:

enter image description here

I have followed the links below:

Stackoverflow: How to Create Tables AWS Athena --> Mappings Json Array?

AWS: https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/

Stackoverflow: https://stackoverflow.com/a/50411984/9409770

And some others that I have forgotten. This problem is driving me crazy.

I have tried querying the same data format on a single file using "S3 Select", it worked without any problem. In Athena, I cannot make it work. I am sure I am making a silly mistake here. However, I cannot figure out where. I am not that expert in this field.

If there is any cheaper alternative to get data programmatically from S3 without incurring much cost, please share this.

Thank you.

  • 1
    each JSON object should be in single line, there should be no newlines inside: `{"data": {...}}` Hive does not support multiline JSON. See https://stackoverflow.com/a/51011818/2700344 and https://stackoverflow.com/a/68745699/2700344 – leftjoin Sep 09 '21 at 16:36
  • So, that means I can use the same data in single row? What about multiple records then? Should those also be in same line? Like: {“data”:[{some records}, {some more records}]}? – Mahmud Arfan Sep 09 '21 at 16:47
  • Each table record = single line in file. Each line is not a part of bigger JSON object, it contains one JSON object. Yes {“data”:[{some records}, {some more records}]} is a single json object and shold be in one line in file. Remove upper array from json if you want to store recors separately, and of course table DDL is what you want to extract from each JSON object. One row in a table = one line in the file – leftjoin Sep 09 '21 at 17:16
  • My goodness, then I have no other option.. thanks a lot for clarifying. – Mahmud Arfan Sep 09 '21 at 18:08
  • 1
    Text files are being processed row by row. first file is being splitted between different completely isolated mappers, each processing part of the file, knowing nothing about other parts/or mappers. think about it. What you did is single non-splittable json object: to get schema, all big object should be read. Also you are trying to create table with single column containing all records in single row. Even if it worked, explode of single row containing array of all records is not a good idea. All will be processed in single mapper, in single thread. – leftjoin Sep 09 '21 at 18:18
  • @leftjoin write your comments as an answer! I was going to write something to the same effect, but you've already answered. – Theo Sep 09 '21 at 19:02
  • I was not sure that it is a multiline JSON. Now it can be closed as duplicate: https://stackoverflow.com/a/68745699/2700344 – leftjoin Sep 09 '21 at 19:38
  • Read also : https://stackoverflow.com/a/67947366/2700344 – leftjoin Sep 09 '21 at 19:48

0 Answers0