0

I'm relatively new to R and SQL and I'm stuck with a query, hoping someone can answer my question.

I'm using the RODBC package and I need to do a rather sophisticated query.

I have an R dataframe that looks like this :

donor_id <- c(1,2)
date <- c(as.Date('2011-01-01'), as.Date('2013-06-14'))
df <- cbind(donor_id, date)

And a big database that has donation dates for every of my donor_id.

I want to add a column to my data frame from the database. This column will be the number of donations that each of my donor_id has made before the date in my data frame.

So according to this post, there are 3 solutions :

  • Query the whole database and work from there (takes forever)
  • Save the data frame into the server and join them in a query (I don't have sufficient rights for this)
  • Pass the conditions from the data frame, which is what I'm trying to do

I have drawn ideas from this post for the general MO and this post for the sum of rows.

The general idea is to merge my data frame and the query

merge(df,query, by='donor_id')

The query looks like this :

paste("select donor_id, SUM(CASE WHEN donation_date <= ",df$date," 
THEN collection_count ELSE 0 END) AS 'ncol' from fact_collection 
GROUP BY donor_id"))

However, when I'm trying this, I get the following error message :

Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

Which means that my merge cannot recognize the donor_id for some reason.

Agaz Wani
  • 5,514
  • 8
  • 42
  • 62
Negarev
  • 79
  • 2
  • 8
  • 2
    can you do `names(df)` and `names(query)`? Do they both have only single `"donor_id"` column? Also need to check the whether any additional punctuation,spaces or capital letters for the column name. – addicted Nov 17 '17 at 03:59
  • df does have only one donor_id column, however query has no column. Executing query produces an error because the argument df$date is not recognized as a valid condition – Negarev Nov 19 '17 at 20:11
  • @addicted But `query` cannot be executed on its own because it needs to select the correct row with `df$date`. The correct row is selected with the `merge` command – Negarev Nov 20 '17 at 01:07
  • you know you cannot merge a dataframe `(df)` with query string `query`, right? `merge` works for joining 2 dataframe, not a dataframe and a query. Your `query <- paste("...", df$Date, "...")` will result in a vector of 2 strings, not dataframe. – addicted Nov 20 '17 at 02:06
  • Yes, I know. my code is actually `merge(df,sqlQuery(connection,query), by=donor_id)` – Negarev Nov 20 '17 at 22:42

1 Answers1

0

Assuming your df has more than one row, you cannot use sqlQuery by itself to execute a series of queries contained in an R character vector. As ?sqlQuery states, argument query must be a valid SQL statement. The output of your paste statement would look something like this:

[1] "select donor_id, SUM(CASE WHEN donation_date <= 2016-01-01
THEN collection_count ELSE 0 END) AS 'ncol' from fact_collection 
GROUP BY donor_id"
[2] "select donor_id, SUM(CASE WHEN donation_date <= 2016-01-02
THEN collection_count ELSE 0 END) AS 'ncol' from fact_collection 
GROUP BY donor_id"
[3] "select donor_id, SUM(CASE WHEN donation_date <= 2016-01-03
THEN collection_count ELSE 0 END) AS 'ncol' from fact_collection 
GROUP BY donor_id"

etc.

Instead of trying to run a vector of queries in a single sqlQuery statement, consider using a function from the apply family, then using rbind to aggregate the data frames into one, e.g.:

all_results <- apply(df$date,1,function(value){
query <- paste("select donor_id, SUM(CASE WHEN donation_date <= ",value," 
THEN collection_count ELSE 0 END) AS 'ncol' from fact_collection 
GROUP BY donor_id"))
result <- sqlQuery(connection,query)
return(result)
}) # returns a list of results in separate data frames

results_df <- do.call(rbind, all_results) # combine list of dataframes into a monolith
John D
  • 467
  • 1
  • 5
  • 10