0

For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").

So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).

# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
                DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1

# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {

  # setup SQL pagination
  if (rq$where == '') rq$where <- '1=1'
  rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))

  # Submit data request
  srdaRequestData(queryURL, rq$select, rq$from, rq$where,
                  DATA_SEP, ADD_SQL)
  assign(dataName, srdaGetData()) # retrieve result
  data <- get(dataName)

  # some code

  # add current data frame to the list
  dfList <- c(dfList, data)
  if (DEBUG) message("*", appendLF = FALSE)
}

# merge all the result pages' data frames
data <- do.call("rbind", dfList)

# save current data frame to RDS file
saveRDS(data, rdataFile)
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • Hi, it's not clear what the issue is. What specifically is not working? What let's you know it's not working. What results are you getting, how do they differ from the results you expect? – Ricardo Saporta Jun 18 '14 at 04:23
  • @RicardoSaporta: Specifically, the issue is that the script becomes unresponsive and its execution is in undetermined (likely, hanging) state (as I noted, only first `*` is printed, while the result contains 1.8M+ records). I will try to reduce the limit for SQL query (per Arthur's answer below) to see, if it could help. I read somewhere about 100K being a borderline value for PostgreSQL pagination (if I understood correctly). I can't use scrollable cursors or other things beyond SQL queries, as I don't have direct access to the system. – Aleksandr Blekh Jun 18 '14 at 04:32

3 Answers3

1

It probably falls into the category when presumably MySQL hinders LIMIT OFFSET: Why does MYSQL higher LIMIT offset slow the query down?

Overall, fetching large data sets over HTTP repeatedly is not very reliable.

Community
  • 1
  • 1
Arthur
  • 1,441
  • 1
  • 13
  • 17
  • Thanks, again, for the help and nice link! I have no choice, but to retrieve data via HTTP, as that's the only interface provided by that particular system. The positive thing is that, after my data collection framework will be fully implemented and tested, I wouldn't need to run the code very frequently, as I cache retrieved data on my side. – Aleksandr Blekh Jun 18 '14 at 04:24
  • 1
    If you have no choice then instead of retrieving the whole dataset limit it to a small subset. If you get it, start increasing. I suspect you are going to hit a limit. And it is probbaly the source internal setting to not to honor long queries (cloud stuff). – Arthur Jun 18 '14 at 13:36
1

Since this is for your dissertation, here is a hand:

## Folder were to save the results to disk.
##  Ideally, use a new, empty folder. Easier then to load from disk
folder.out <- "~/mydissertation/sql_data_scrape/"
## Create the folder if not exist. 
dir.create(folder.out, showWarnings=FALSE, recursive=TRUE)


## The larger this number, the more memory you will require. 
## If you are renting a large box on, say, EC2, then you can make this 100, or so
NumberOfOffsetsBetweenSaves <- 10

## The limit size per request
RQ_SIZE <- 1000

# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
                DATA_SEP, ADD_SQL)


## Get the total number of rows
TotalRows <- as.numeric(srdaGetData())

TotalNumberOfRequests <- TotalRows %/% RQ_SIZE

TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1

## FYI: Total number of rows being requested is
##  (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups) 


for (g in seq(TotalNumberOfGroups)) {

  ret <- 
    lapply(seq(NumberOfOffsetsBetweenSaves), function(i) {

      ## function(i) is the same code you have
      ##    inside your for loop, but cleaned up.

      # setup SQL pagination
      if (rq$where == '') 
          rq$where <- '1=1'

      rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1))

      # Submit data request
      srdaRequestData(queryURL, rq$select, rq$from, rq$where,
                      DATA_SEP, ADD_SQL)

       # retrieve result
      data <- srdaGetData()

      # some code

      if (DEBUG) message("*", appendLF = FALSE)    


      ### DONT ASSIGN TO dfList, JUST RETURN `data`
      # xxxxxx DONT DO: xxxxx dfList <- c(dfList, data)
      ### INSTEAD:

      ## return
      data
  })

  ## save each iteration
  file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g)
  saveRDS(do.call(rbind, ret), file=file.out)

  ## OPTIONAL (this will be slower, but will keep your rams and goats in line)
  #    rm(ret)
  #    gc()
}

Then, once you are done scraping:

library(data.table)

folder.out <- "~/mydissertation/sql_data_scrape/"

files <- dir(folder.out, full=TRUE, pattern="\\.RDS$") 

## Create an empty list
myData <- vector("list", length=length(files))


## Option 1, using data.frame
    for (i in seq(myData))
      myData[[i]] <- readRDS(files[[i]])

    DT <- do.call(rbind, myData)

## Option 2, using data.table
    for (i in seq(myData))
      myData[[i]] <- as.data.table(readRDS(files[[i]]))

    DT <- rbindlist(myData)
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Ricardo, thank you very much for your help! I looked through your suggested code and it looks like the main essential difference is to add an *intermediate* step of storing the results in `.rds` files. I'm a bit confused about why this is better than my original approach. Especially, since the ultimate goal of storing data collection results makes doing file operations twice. Would appreciate, if you could clarify your approach (and what's wrong with storing intermediate results in memory (`dfList`) instead of file system). Thank you, again! I look forward to hearing from you. – Aleksandr Blekh Jun 18 '14 at 07:07
  • 1
    The real difference in our two codes is the use of `lapply` in mine. The purpose of saving it to disk is simply backup our work. Remember, Jesus Saves. If you feel the data is small enough that this is not necessary, then you can skip this step and just save the data directly to a list, and rbind the whole list – Ricardo Saporta Jun 18 '14 at 15:41
  • 1
    btw, I had forgotten a `do.call(rbind, ret)` in the `saveRDS`. Edited and fixed – Ricardo Saporta Jun 18 '14 at 15:42
0

I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.

The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.

This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64