-2

I have a parquet file with the following format

id = 2a1ed0848022
raw_value:
[{"state":"MO","city":"O Fallon","location_name":"Jackson Hewitt Tax Service","top_category":"Accounting, Tax Preparation, Bookkeeping, and Payroll Services"},
{"state":"IL","city":"Collinsville","location_name":"L E Smith Jewelry","top_category":"Jewelry, Luggage, and Leather Goods Stores"},
{"state":"MO","city":"O Fallon","location_name":"Bagwasi Family Eyecare","top_category":"Health and Personal Care Stores"},
{"state":"MO","city":"O Fallon","location_name":"Rally's Drive-In Restaurants","top_category":"Restaurants and Other Eating Places"},
{"state":"IL","city":"Collinsville","location_name":"BP","top_category":"Gasoline Stations"}

I would like to create a table in Athena on this parquet file and run a query like this

select maid from test12 where state="MD" and city="Baltimore".

How can I search state and city from the second column which has nested JSON.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Himateja Madala
  • 321
  • 1
  • 4
  • 16

1 Answers1

0

The key to this is to use UNNEST. I'm assuming that raw_value is typed as array<struct<state:string,city:string,location_name:string,top_category:string>>.

SELECT id
FROM the_table CROSS JOIN UNNEST(raw_value) rv (location)
WHERE location.state = 'MD' AND city = 'Baltimore'

Using UNNEST like this expands each row in the table to one row per element in the array.

If raw_value is a string column, you need to parse it first. You can find an example of this in this answer: https://stackoverflow.com/a/56176204/1109

Theo
  • 131,503
  • 21
  • 160
  • 205