1

I have a list of dataframes, which I would like to perform a merge on. How can I pass each element of the list to the respective sqldf-statement? My code is as follows:

require(sqldf)
a <- data.frame(ID=c("a","b","c","d","a","a"),num=1:6,num2=2:7)
b <- data.frame(ID=c("d","a","a","a","b","c"),num=6:1,num2=7:2)
Datalist <- list(a,b)
mergeto <- data.frame(ID=c("a","a","a","b","c","d"),name=rep("A",6),name2=rep("B"),6)
test <- sqldf("
select *
from Datalist[[1]] as a left outer join mergeto as b
where a.ID = b.ID and
")

I get the error:

Error in rsqlite_send_query(conn@ptr, statement) : unrecognized token: "]"

Is there any way to use the list variables directly from R? I mean I know I could use a <- Datalist[[1]] and then use a in the sqldf-statement, but this way a create a copy of Datalist elements every time and Datalist is already very large..I am sure there must be a way. Also to access variables directly that contain strings, there must be a way? So what I want to do in the end is perform SQL-statements in a loop, so I must also be able to change the index within the statement. But because it's a string, it doesn't seem easily possible anymore.

EDC
  • 613
  • 2
  • 7
  • 16
  • Do you really need to use `sqldf`? Merging a list of data frames is pretty well covered [in this R-FAQ](https://stackoverflow.com/q/8091303/903061). `base::merge`, `data.table::merge`, and `dplyr::left_join` are all fully capable of doing a left join. – Gregor Thomas Oct 12 '17 at 21:16
  • yes I need, because I have some complicated merge conditions. I don't think I can use base merge for that, at least it wasn't possible when I checked about 1 year ago. If base merge can do everything SQL can, then yes, I would switch immediately. – EDC Oct 12 '17 at 21:17
  • `data.table` can do non-equi joins, though not using it's `merge` functionality (AFAIK). I think your best bet using SQL df is to (1) name your list, (2) use `list2env` to put your data in an environment, (3) specify that environment using the `envir` argument of SQL df, and (4) `paste` together the sql statements with the names of your lists. You might want to ask another question with an example of your complicated conditions to see if it is possible with a list of `data.table`s and their advanced features. – Gregor Thomas Oct 12 '17 at 21:23
  • @Gregor thanks for your advice. Maybe I will do that as well, but for now I would like to see a solution in sqldf, if it is possible. I added a MWE on top of my question. – EDC Oct 12 '17 at 21:30
  • 1
    Can you back up and provide your full X problem and not your proposed Y solution (i.e., [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem))? You mention a loop and *complicated merge conditions* but we do not see either here. – Parfait Oct 12 '17 at 22:02
  • @Parfait yeah, the reason I didn't provide it is that I don't think it changes anything if you would use a loop here, e.g. looping with an testindex of 1 would be the same problem as stated above. Unfortunately I cannot provide the datasets I am using to simulate the merge conditions, but in this regard I also think it's not relevant as much for handling `sqldf`, because I can actually do the merge successfully for one dataset. So this would only be relevant for a datatable solution I believe. I might make a new thread in this regard if needed. But thanks for your interest/suggestions. – EDC Oct 12 '17 at 22:06
  • 1
    Once again, you are not explaining what you are doing overall. Your focus is on packages. If you need to merge multiple dataframes in a list use the [chain merge with `Reduce()`](https://stackoverflow.com/questions/13778267/how-can-i-merge-multiple-dataframes-with-the-same-column-names). – Parfait Oct 13 '17 at 00:59
  • Ok what I want to do overall is to to merge several dataframes to one target dataframe. I have the dataframes to be merged to the target in a list. I would like to use sqldf to perform the merges. What you link is similar to what I want, but in my case there are additional conditions which I did not mention here because I thought it was not necessary. For example, one condition is the SQL statement 'between x and y' where x and y are dates. – EDC Oct 13 '17 at 02:43

1 Answers1

3

Try the following which was already outlined by @Gregor in a comment but here we make it explicit. We use a named list L consisting of three data frames each one of which equals the built in BOD data frame and join the second and subsequent data frames to the first on Time constructing the text of the SQL statement sql from its two parts: sel_from (a character string containing the select and from) and joins (a character vector of join definitions).

library(sqldf)

L <- list(BOD1 = BOD, BOD2 = BOD, BOD3 = BOD) # named list as input
nms <- names(L)
sel_from <- sprintf("select * from %s", nms[1])
joins <- sprintf("join %s on %s.Time = %s.Time", nms[-1], nms[-1], nms[1])
sql <- paste(c(sel_from, joins), collapse = "\n")
sqldf(sql, envir = list2env(L))

This gives the following output:

  Time demand Time demand Time demand
1    1    8.3    1    8.3    1    8.3
2    2   10.3    2   10.3    2   10.3
3    3   19.0    3   19.0    3   19.0
4    4   16.0    4   16.0    4   16.0
5    5   15.6    5   15.6    5   15.6
6    7   19.8    7   19.8    7   19.8

Also we can look at the contents of sql like this:

cat(sql, "\n")

giving:

select * from BOD1
join BOD2 on BOD2.Time = BOD1.Time
join BOD3 on BOD3.Time = BOD1.Time 
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • cool! Now i need to say more to make the comment long enough ..so .. great! I'm doing exactly in the OP: list of dataframes and running sql against them. – WestCoastProjects Jun 25 '18 at 02:57