I have a query which runs on Athena (directly) in 43 second by scanning 90GB data. I then use pyathena to run the same query (I use it in jupyter notebook on EMR) and it just doesn't finish running (and never returns any results). I have tested it for smaller queries (by placing limit 100) and it works. Is there any reason why pyathena is much slower than running queries directly on Athena?
Asked
Active
Viewed 1,725 times
1 Answers
0
I guess you are using the fetchall api.
The fetchall method is using fetchone until no more results and that why it take such long time.
def fetchall(self):
rows = []
while True:
row = self.fetchone()
if row:
rows.append(row)
else:
break
return rows
You can use the cursor.output_location to get the csv results file Athena creates to read the results (with boto3 for example):
bucket, key = cursor.output_location.strip('s3://').split('/', 1)
session = boto3.session.Session(aws_access_key_id=YOUR_AWS_KEY, aws_secret_access_key=YOUR_AWS_SECRET, region_name=YOUR_REGION_NAME)
result = session.resource('s3').Object(bucket, key)
In addition in the next versions of pyathena you will able to read the results to pandas dataframe by using as_pandas api which reads the file from s3.

Raz Omessi
- 1,812
- 1
- 14
- 13