7

I run a query from AWS Athena console and takes 10s. The same query run from Sagemaker using PyAthena takes 155s. Is PyAthena slowing it down or is the data transfer from Athena to sagemaker so time consuming?

What could I do to speed this up?

room13
  • 883
  • 2
  • 10
  • 26

1 Answers1

18

Just figure out a way of boosting the queries:

Before I was trying:

import pandas as pd
from pyathena import connect

conn = connect(s3_staging_dir=STAGIN_DIR,
             region_name=REGION)
pd.read_sql(QUERY, conn)
# takes 160s

Figured out that using a PandasCursor instead of a connection is way faster

import pandas as pd
pyathena import connect
from pyathena.pandas.cursor import PandasCursor

cursor = connect(s3_staging_dir=STAGIN_DIR,
                 region_name=REGION,
                 cursor_class=PandasCursor).cursor()
df = cursor.execute(QUERY).as_pandas()
# takes 12s

Ref: https://github.com/laughingman7743/PyAthena/issues/46

room13
  • 883
  • 2
  • 10
  • 26
  • 1
    It seems like the time comments are reversed. The top should take longer i.e. 160s and the bottom takes 12s. – Neelotpal Shukla Dec 02 '20 at 19:11
  • 1
    it's `pyathena.pandas.cursor` and not `pyathena.pandas_underscore_cursor` . Just a minor typo i understand but could save some potential searches along the way :-) @room13 @trudolf – Ram K Jul 29 '21 at 20:42
  • you are right @RameshK unfortunately i cannot fix it. it says edits have to be at least 6 characters ^^ – trudolf Jul 30 '21 at 04:39