0

I have recently started exploring some ETL solutions on AWS and came across AWS glue. Thus far it's proving to be a major time sink for no real results but hopefully that's because of my lack of experience with it.

I have plenty of data in S3 bucket all residing in .json files. A single file contains an array of objects. I've created a crawler to go through them with a json classifier ($[*]). I run the crawler and the table gets created/updated.

Here's where my problem begins - when trying to query this data through Athena i get the following issue

enter image description here

Athena puts the whole json object inside every single column of my table definition. I've searched everywhere, tried different serializers in the table definition and nothing seems to help. I've been reading that athena requires objects to be in a single line but i can't control how aws glue packs the data in a table.

If anyone encountered similar issues or know what this could be i would be eternally grateful. Thanks!

Edit 1 :

As requested in the comments here's what SHOW CREATE TABLE looks like for the table

CREATE EXTERNAL TABLE `calcal`(
  `id` string COMMENT 'from deserializer', 
  `status` string COMMENT 'from deserializer', 
  `active` string COMMENT 'from deserializer', 
  `title` string COMMENT 'from deserializer', 
  `minimum_value` string COMMENT 'from deserializer', 
  `maximum_value` string COMMENT 'from deserializer', 
  `value_operator` string COMMENT 'from deserializer', 
  `value_notes` string COMMENT 'from deserializer', 
  `sector` string COMMENT 'from deserializer', 
  `construction_type` string COMMENT 'from deserializer', 
  `early_planning_and_development_date` string COMMENT 'from deserializer', 
  `bidding_date` string COMMENT 'from deserializer', 
  `pre_construction_date` string COMMENT 'from deserializer', 
  `construction_date` string COMMENT 'from deserializer', 
  `project_schedule` string COMMENT 'from deserializer', 
  `bid_due_date` string COMMENT 'from deserializer', 
  `bidding_information` string COMMENT 'from deserializer', 
  `start_date` string COMMENT 'from deserializer', 
  `start_date_notes` string COMMENT 'from deserializer', 
  `end_date` string COMMENT 'from deserializer', 
  `end_date_notes` string COMMENT 'from deserializer', 
  `address1` string COMMENT 'from deserializer', 
  `address2` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `state` string COMMENT 'from deserializer', 
  `zip` string COMMENT 'from deserializer', 
  `county` string COMMENT 'from deserializer', 
  `other_info` string COMMENT 'from deserializer', 
  `details` string COMMENT 'from deserializer', 
  `update_notes` string COMMENT 'from deserializer', 
  `material_classification` string COMMENT 'from deserializer', 
  `submitted_by` string COMMENT 'from deserializer', 
  `created_at` string COMMENT 'from deserializer', 
  `updated_at` string COMMENT 'from deserializer', 
  `deleted_at` string COMMENT 'from deserializer', 
  `types` array<struct<id:string,name:string,code:string,pivot:struct<project_id:string,project_type_id:string>>> COMMENT 'from deserializer', 
  `user` string COMMENT 'from deserializer', 
  `stage` struct<id:string,name:string> COMMENT 'from deserializer')
PARTITIONED BY ( 
  `partition_0` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='active,address1,address2,bid_due_date,bidding_date,bidding_information,city,construction_date,construction_type,county,created_at,deleted_at,details,early_planning_and_development_date,end_date,end_date_notes,id,material_classification,maximum_value,minimum_value,other_info,pre_construction_date,project_schedule,sector,stage,start_date,start_date_notes,state,status,submitted_by,title,types,update_notes,updated_at,user,value_notes,value_operator,zip') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://dev-smart-dw/Construction/CAL/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='CAL-Test-2', 
  'averageRecordSize'='1261', 
  'classification'='json', 
  'compressionType'='none', 
  'jsonPath'='$[*]', 
  'objectCount'='36', 
  'recordCount'='1800', 
  'sizeKey'='2285975', 
  'typeOfData'='file')

The s3 data is just a bunch of json files that contain a single array with 50 of these objects inside. 1 array per file, multiple files split into multiple folders.

Vulegend
  • 105
  • 1
  • 2
  • 10
  • Could you provide some examples from the underlying s3 data. And also run `SHOW CREATE TABLE` against the table created by the glue table? – Philipp Johannis Oct 27 '20 at 12:31
  • @PhilippJohannis Hey, thanks for the reply. I've edited the question and added extra information requested – Vulegend Oct 27 '20 at 14:15
  • The table definition looks good at first glance. You mention that "A single file contains an array of objects." - is it that every field contains one element of the array? – Philipp Johannis Oct 27 '20 at 16:11
  • @PhilippJohannis It's just a .json file that holds an array of these objects. Starts with the [ { object }, { object } ]. Object structure matches what aws glue classifier identified – Vulegend Oct 28 '20 at 14:23
  • Yeah, JSON Serde seem to have some issues with arrays of jsons and rather recommends to have obe object per line. Anyway - have a look here, this post addresses the same issue: https://stackoverflow.com/a/50411984/2838867 – Philipp Johannis Oct 28 '20 at 20:03
  • @PhilippJohannis I am already using that as a classifier and it still isn't working. That is the weird part. I can't control how Glue packs the data structure. – Vulegend Oct 29 '20 at 18:28
  • This output is created by an AWS Glue Job? That sounds very strange. But did you see the comment in the answer? `I found that I had to create a new crawler after creating the custom classifier, otherwise Glue would continue to classify as an array. Just another Glue bug I guess` – Philipp Johannis Oct 29 '20 at 18:35
  • @PhilippJohannis I departed from the project for a bit and just came to it. I created a new classifier and it still doesn't work, same exact results – Vulegend Nov 06 '20 at 17:15
  • I actually think that it is not possible - also tried several different approaches. Also in https://docs.aws.amazon.com/athena/latest/ug/parsing-JSON.html it states to "Make sure that each JSON-encoded record is represented on a separate line." – Philipp Johannis Nov 06 '20 at 21:08
  • Are the functions that unnest data and/or split split arrays of any use? If you are able to generated multiple rows then you will need to use json functions to extract data from these tables. – user 923227 Nov 13 '20 at 05:38

0 Answers0