0

i'm not able to query S3 files with Aws Athena, the content of the files are regular json arrays like this:

[
  {
    "DataInvio": "2020-02-06T13:37:00+00:00",
    "DataLettura": "2020-02-06T13:35:50+00:00",
    "FlagDownloaded": 0,
    "GUID": "f257c9c0-b7e1-4663-8d6d-97e652b27c10",
    "IMEI": "866100000062167",
    "Id": 0,
    "IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
    "IdTagLocal": 0,
    "SerialNumber": "142707160028BJZZZZ",
    "Tag": "E200001697080089188056D2",
    "Tipo": "B",
    "TipoEvento": "L",
    "TipoSegnalazione": 0,
    "TipoTag": "C",
    "UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
  },
  {
    "DataInvio": "2020-02-06T13:37:00+00:00",
    "DataLettura": "2020-02-06T13:35:50+00:00",
    "FlagDownloaded": 0,
    "GUID": "e531272e-465c-4294-950d-95a683ff8e3b",
    "IMEI": "866100000062167",
    "Id": 0,
    "IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
    "IdTagLocal": 0,
    "SerialNumber": "142707160028BJZZZZ",
    "Tag": "E200341201321E0000A946D2",
    "Tipo": "B",
    "TipoEvento": "L",
    "TipoSegnalazione": 0,
    "TipoTag": "C",
    "UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
  }
]

a simple query select * from mytable returns empty rows if the table has been generated in this way

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
  `IdSessione` string,
  `DataLettura` date,
  `GUID` string,
  `DataInvio` date 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'ignore.malformed.json' = 'true'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')

or it gives me an error HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Missing value at 1 [character 2 line 1] if the table has been generated with:

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable(
  `IdSessione` string,
  `DataLettura` date,
  `GUID` string,
  `DataInvio` date 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')

if i modify the content of the file in this way (an json object each rows without trailing commas, the query gives me results)

{ "DataInvio": "2020-02-06T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
{ "DataInvio": "2020-02-07T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
    

How to query json array structures directly?

pinale
  • 2,060
  • 6
  • 38
  • 72

2 Answers2

1

Athena Best Practices recommends to have one json per row:

Make sure that each JSON-encoded record is represented on a separate line.

This has been asked a few times and I don't think someone made it work with a array of json:

Philipp Johannis
  • 2,718
  • 1
  • 15
  • 18
0

This is related to the formatting of the JSON objects. The resolution of these issues is also described here: https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/

Apart from this, if you are using AWS Glue to crawl these files, make sure the Classification of database table of Data Catalog is not "UNKNOWN".

A K
  • 31
  • 4