1

I have been trying to get data from Athena through the paws package into R. So far I have successfully gotten the query to run and return some results. But I get the default max of 1000. I've seen some solutions out there for the Boto3 library in Python, but even though the syntax is similar, I can call a pagination function like the one Boto3 has.

Anyone knows how to do pagination, or how to use the next token argument of the function?

Here is what my code is looking like:

SDK for connecting to AWS services

install.packages('paws')
install.packages('tidyverse')

Creating an Athena object:

athena <- paws::athena()

Writting the query and specifyng the output location:

query <- athena$start_query_execution(QueryString = '
                                                    SELECT *
                                                    FROM db.table
                                                    LIMIT 100000
                                                    ',
                                       ResultConfiguration = list(OutputLocation =
                                                                  "s3://aws-athena..."
                                                                  )
                                      )

Executing the query

result <- athena$get_query_execution(QueryExecutionId = query$QueryExecutionId)

Getting the query output:

output <- athena$get_query_results(QueryExecutionId = query$QueryExecutionId)

Parsing into a table object

data <- dplyr::as_tibble(t(matrix(unlist(output$ResultSet$Rows),nrow = length(unlist(output$ResultSet$Rows[1])))))

colnames(data) <- as.character(unlist(data[1, ]))
data <- data[-1, ]
Ramón J Romero y Vigil
  • 17,373
  • 7
  • 77
  • 125
xricky
  • 31
  • 7

2 Answers2

1

This is late, but it does answer the original post. You can use the NextToken attribute from get_query_results() to get all of the results from your query. I have not benchmarked it, but have noted it in simple examples as being faster since it is not establishing a connection to the entire Athena 'data base' in the way using dbConnect() with RAthena::athena() or noctura::athena() will. The following loop will get all of your query results into your tibble:

# starting with the OP results processing 
output <- athena$get_query_results(QueryExecutionId = query$QueryExecutionId)

data <- dplyr::as_tibble(t(matrix(unlist(output$ResultSet$Rows),
                                  nrow = length(unlist(output$ResultSet$Rows[1])))))

colnames(data) <- as.character(unlist(data[1, ]))
data <- data[-1, ]

# adding this loop will get the rest of your query results
while(length(output$NextToken) == 1) {
  output <- athena$get_query_results(QueryExecutionId = query$QueryExecutionId,
                                     NextToken = output$NextToken)
  tmp <- dplyr::as_tibble(t(matrix(unlist(output$ResultSet$Rows),
                                   nrow = length(unlist(output$ResultSet$Rows[1])))))
  colnames(tmp) <- colnames(data)
  data <- rbind(data, tmp)
}
JBinggeli
  • 23
  • 3
0

You might want to consider the noctua package. This package connects R to Athena using paws SDK (DBI Interface). It solves the issue you are having with the limitation of 1000 rows. So your above query will look like:

library(DBI)

con = dbConnect(noctua::athena(), s3_staging_dir = "s3://aws-athena...")

dbGetQuery(con, "SELECT * FROM db.table LIMIT 100000")

The package also offers integration into dplyr:

library(DBI)
library(dplyr)

con = dbConnect(noctua::athena(), 
                schema_name = "db", 
                s3_staging_dir = "s3://aws-athena...")

tbl(con, "table"))
Dyfan Jones
  • 229
  • 2
  • 9
  • Thanks, that did the trick! I do have one reservation, It works fine with a couple of 100K rows, but when I try 1MIL or above I get and throttling exception. I guess some background is needed, I've trying to run my code in AWS Sagemaker. So to get more data in, should I create a bigger instance, or is it something that needs to be done on the account administration side of things? Any help with this would really be appreciated. – xricky Jan 16 '20 at 16:42
  • My initial guess would be to double check how big the sagemaker instance is as it might of maxed out your ram. If it isn't that can you raise a ticket on: https://github.com/DyfanJones/noctua/issues. That will help to identify if it is an issue with the paws sdk when transferring large data. Alternatively you can try RAthena it does exactly the same as noctua however it uses boto3 for the connection instead. Finally if it is a ram issue please raise a feature request for larger than memory queries, I am the author of noctua and happy to look into developing further improvements for the package – Dyfan Jones Jan 16 '20 at 17:34
  • Awesome, I will look into this and put in a request if needed. Thanks for the help! – xricky Jan 16 '20 at 17:39