1

I need to run a multiple join query with sqldf. I have an original code written in SQL that has to be replicated in R, and so I'm using sqldf. My problem is that among the nested joins, there is a FULL JOIN, and FULL JOIN is still not supported in sqldf . So I'm looking for a workaround.

The nested joins I have are a full join followed by a left join. My solution was to bring the full join out of the query, run it, and with the resulting data set, run the next left join. The original code says:

select
  i1.ID as id1, 
  i2.ID as id2,
  i3.type
  from Blad1 i1 
  full join Blad2 ii3 on i1.ID_pers = ii3.ID_pers_1
  left join Blad1 i2 on ii3.ID_pers_2 = i2.Id_pers 
  left join Blad1 i3 on i1.ID_pers = i3.Id_pers
  where i1.Type = 'OCCUPATION'

and my solution was to run the full join with the dplyr function full_join:

fulljoin <- full_join(Blad1, Blad2, by = c("ID_pers" = "ID_pers_1"))

and then, the query:

B <- sqldf ("select
             i1.ID as id1, 
             i2.ID as id2,
             i3.type
             from fulljoin i1
             left join INDIVI i2 on i1.ID_pers_2 = i2.Id_pers
             left join Blad1 i3 on i1.ID_pers = i3.Id_pers
             where i1.Type = 'OCCUPATION'")

The two data sets are:

Blad1

ID  ID_D      ID_pers   SOURCE           TYPE        VALUE
1   STANDARD    1   PARISH RECORD   ARRIVAL_FROM    Bijmeer
2   STANDARD    2   PARISH RECORD   OCCUPATION      Almere
3   STANDARD    2   PARISH RECORD   ARRIVAL_FROM    WISKUNDE
4   STANDARD    3   PARISH RECORD   OCCUPATION      BILDERDIJK
5   STANDARD    4   PARISH RECORD   ARRIVAL_FROM    Enschede

Blad2

ID      ID_D    ID_pers_1   ID_pers_2   RELATION
933104  STANDARD    4        2021643    Father
1478    STANDARD    4        1          Child
1664    STANDARD    2118979  2021107     Child
6918    STANDARD    2481830  2          Child
7088    STANDARD    2        2011817     Child
1268853 STANDARD    1        3          Mother

The problem is that my solution is not working properly as the original clause after the left join uses table Blad 2

(ii3.ID_pers_2 = i2.Id_pers), 

whereas my solution, for that same clause I can only do it with the Full Join resulting table

(i1.ID_pers_2 = i2.Id_pers)

These two conditions are not the same, so the outcome of both left joins are different, and so the nested queries, and so the final result.

Any tips, welcome.

Thank you for reading.

Paco
  • 65
  • 9
  • 1
    You could use `dplyr::full_join` with `dbplyr` (the latter package translates typical `dplyr` verbs into SQL). – heds1 Aug 13 '19 at 01:09
  • Thank you for your reply. I think what you propose is to do the same I had already done, though -correct me if I'm wrong-, as I was already using that function in my solution (by taking that part out of the query). But the problem with this, as I was commenting when I posted my problem, is that, when I run the next join (left join) -were a condition based on Blad2 and not on the whole resulting table of the full_join is set-, then I get a different final outcome. – Paco Aug 13 '19 at 15:07
  • 1
    You can rewrite a FULL JOIN into a UNION of a LEFT JOIN and a RIGHT JOIN. – wildplasser Aug 14 '19 at 14:28
  • Thank you @wildplasser. The problem I see with that solution is that I must select another field from another table, so when I perform the left and right joins with the clauses affecting i1 and i2, there is a field (Type) that affects another table (i3) that prevents that workaround to be run without error. – Paco Aug 14 '19 at 14:49
  • 1
    .., but for the keyfields you may need `coalesce()` – wildplasser Aug 14 '19 at 14:50
  • Thank you.. The problem is that, due to the multiple joins and crossed references, applying coalesce() doesn't make it work either – Paco Aug 28 '19 at 20:33

1 Answers1

2

sqldf supports 4 backends and the PostgreSQL backend does support full joins.

Modify your code to something like this:

library(RPostgreSQL)
library(sqldf)

sqldf('
  select
    i1."ID" as id1, 
    i2."ID" as id2
  from "Blad1" i1 
    full join "Blad2" ii3 on i1."ID_pers" = ii3."ID_pers_1"
    left join "Blad1" i2 on ii3."ID_pers_2" = i2."ID_pers"
  where i1."TYPE" = \'OCCUPATION\'
')

There is more info on using PostgreSQL with sqldf in ?sqldf and in FAQ 12. How does one use sqldf with PostgreSQL? on the sqldf github page.

An alternative is to simulate full joins with the default. See sqlite backend: http://www.sqlitetutorial.net/sqlite-full-outer-join/ or FULL OUTER JOIN with SQLite

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you very much. The only problem with PostgreSQL is that sadly, we need the code without other installations of new tools (in order to provide it to a net of researchers with a common methodology). RPostgreSQL, requires the effort of intalling conveniently PostgreSQL, which is not trivial... – Paco Aug 14 '19 at 14:12
  • Then use the alternative mentioned in the answer. – G. Grothendieck Aug 14 '19 at 14:16
  • Thank you again. I just edited my post, because my original query has more than 2 joins (full and left) and the selected fields in the query are related to more tables (not only i1 and i2, but also i3), so to use the UNION left joins doesn't seem to be feasible to me (or at least I don't know how to), as it gives an error for the field type of i3, which doesn't exist in the UNION of left joins. That's how I understand the error... If it was only the full and left joins, your suggestion was great. Thank you for your help. – Paco Aug 14 '19 at 14:55