I need to process some data of a certain flow that I have in a specific folder in a bucket S3
. I want to do this in Python
. After searching for a while I found the library PyAthena
which exactly what I was looking for!
I installed the version 1.8.0
of PyAthena
.
For your information, I have my S3
bucket in the region of Paris eu-west-3
and my Athena
database in the region of Francfort eu-central-1
.
I used the following code that I found in the documentation PyAthena Doc:
from pyathena import connect
cursor = connect(aws_access_key_id='YOUR_ACCESS_KEY_ID',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2').cursor()
cursor.execute("SELECT * FROM one_row")
print(cursor.description)
print(cursor.fetchall())
I was at first not sure about which region_name
to use, if it should be the one of Paris
where the S3
bucket is, or the one of Francfort
where the Athena
database is!!
I tried both and following the error messages I got, I ended up using the one of my S3
bucket!
However I kept getting errors of permissions about Glue
, something like:
pyathena.error.OperationalError: Insufficient permissions to execute the query. Error retrieving table : master in database : default due to : User: arn:aws:iam::<my-account-client-ID>:user/s3-test is not authorized to perform: glue:GetTable on resource: arn:aws:glue:eu-west-3:<my-account-client-ID>:catalog
So I added the following strategy in the IAM
:
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResultsStream",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:*",
"athena:ListQueryExecutions",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:StopQueryExecution",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution"
],
"Resource": "*"
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": [
"glue:GetTable",
"glue:GetTables",
"glue:GetDatabase"
],
"Resource": [
"arn:aws:glue:eu-west-3:<my-account-client-ID>:catalog",
"arn:aws:glue:eu-west-3:<my-account-client-ID>:database/*",
"arn:aws:glue:eu-west-3:<my-account-client-ID>:table/*/*"
]
}
Now I have this error message:
cursor.execute("select * from master")
File "/home/ubuntu/.local/lib/python3.6/site-packages/pyathena/util.py", line 28, in _wrapper
return wrapped(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pyathena/cursor.py", line 57, in execute
raise OperationalError(query_execution.state_change_reason)
pyathena.error.OperationalError: SYNTAX_ERROR: line 1:15: Schema default does not exist