I have recently started exploring some ETL solutions on AWS and came across AWS glue. Thus far it's proving to be a major time sink for no real results but hopefully that's because of my lack of experience with it.
I have plenty of data in S3 bucket all residing in .json files. A single file contains an array of objects. I've created a crawler to go through them with a json classifier ($[*]). I run the crawler and the table gets created/updated.
Here's where my problem begins - when trying to query this data through Athena i get the following issue
Athena puts the whole json object inside every single column of my table definition. I've searched everywhere, tried different serializers in the table definition and nothing seems to help. I've been reading that athena requires objects to be in a single line but i can't control how aws glue packs the data in a table.
If anyone encountered similar issues or know what this could be i would be eternally grateful. Thanks!
Edit 1 :
As requested in the comments here's what SHOW CREATE TABLE
looks like for the table
CREATE EXTERNAL TABLE `calcal`(
`id` string COMMENT 'from deserializer',
`status` string COMMENT 'from deserializer',
`active` string COMMENT 'from deserializer',
`title` string COMMENT 'from deserializer',
`minimum_value` string COMMENT 'from deserializer',
`maximum_value` string COMMENT 'from deserializer',
`value_operator` string COMMENT 'from deserializer',
`value_notes` string COMMENT 'from deserializer',
`sector` string COMMENT 'from deserializer',
`construction_type` string COMMENT 'from deserializer',
`early_planning_and_development_date` string COMMENT 'from deserializer',
`bidding_date` string COMMENT 'from deserializer',
`pre_construction_date` string COMMENT 'from deserializer',
`construction_date` string COMMENT 'from deserializer',
`project_schedule` string COMMENT 'from deserializer',
`bid_due_date` string COMMENT 'from deserializer',
`bidding_information` string COMMENT 'from deserializer',
`start_date` string COMMENT 'from deserializer',
`start_date_notes` string COMMENT 'from deserializer',
`end_date` string COMMENT 'from deserializer',
`end_date_notes` string COMMENT 'from deserializer',
`address1` string COMMENT 'from deserializer',
`address2` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`state` string COMMENT 'from deserializer',
`zip` string COMMENT 'from deserializer',
`county` string COMMENT 'from deserializer',
`other_info` string COMMENT 'from deserializer',
`details` string COMMENT 'from deserializer',
`update_notes` string COMMENT 'from deserializer',
`material_classification` string COMMENT 'from deserializer',
`submitted_by` string COMMENT 'from deserializer',
`created_at` string COMMENT 'from deserializer',
`updated_at` string COMMENT 'from deserializer',
`deleted_at` string COMMENT 'from deserializer',
`types` array<struct<id:string,name:string,code:string,pivot:struct<project_id:string,project_type_id:string>>> COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`stage` struct<id:string,name:string> COMMENT 'from deserializer')
PARTITIONED BY (
`partition_0` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='active,address1,address2,bid_due_date,bidding_date,bidding_information,city,construction_date,construction_type,county,created_at,deleted_at,details,early_planning_and_development_date,end_date,end_date_notes,id,material_classification,maximum_value,minimum_value,other_info,pre_construction_date,project_schedule,sector,stage,start_date,start_date_notes,state,status,submitted_by,title,types,update_notes,updated_at,user,value_notes,value_operator,zip')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://dev-smart-dw/Construction/CAL/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='CAL-Test-2',
'averageRecordSize'='1261',
'classification'='json',
'compressionType'='none',
'jsonPath'='$[*]',
'objectCount'='36',
'recordCount'='1800',
'sizeKey'='2285975',
'typeOfData'='file')
The s3 data is just a bunch of json files that contain a single array with 50 of these objects inside. 1 array per file, multiple files split into multiple folders.