Why pyathena doesn't work on longer running queries while Athena runs them?

Question

I have a query which runs on Athena (directly) in 43 second by scanning 90GB data. I then use pyathena to run the same query (I use it in jupyter notebook on EMR) and it just doesn't finish running (and never returns any results). I have tested it for smaller queries (by placing limit 100) and it works. Is there any reason why pyathena is much slower than running queries directly on Athena?

score 0 · Answer 1 · answered Nov 15 '18 at 16:38

I guess you are using the fetchall api.

The fetchall method is using fetchone until no more results and that why it take such long time.

def fetchall(self):
    rows = []
    while True:
        row = self.fetchone()
        if row:
            rows.append(row)
        else:
            break
    return rows

You can use the cursor.output_location to get the csv results file Athena creates to read the results (with boto3 for example):

bucket, key = cursor.output_location.strip('s3://').split('/', 1)
session = boto3.session.Session(aws_access_key_id=YOUR_AWS_KEY, aws_secret_access_key=YOUR_AWS_SECRET, region_name=YOUR_REGION_NAME)

result = session.resource('s3').Object(bucket, key)

In addition in the next versions of pyathena you will able to read the results to pandas dataframe by using as_pandas api which reads the file from s3.

Why pyathena doesn't work on longer running queries while Athena runs them?

1 Answers1