This seems like it should be simple but as a complete beginner to SQL, I've spent a large amount of time trying to figure this out without any luck
Say I have a dataframe in R like this:
df1 <- data.frame(value = c(1, 2, 3, 4),
ID = c("foo", "bar", "baz", "waffle"))
# value ID
# 1 1 foo
# 2 2 bar
# 3 3 baz
# 4 4 waffle
I can load it into a SQLite database easily:
table_name <- "mytable"
library("RSQLite")
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, table_name, df1, row.names = FALSE)
dbListTables(con)
# [1] "mytable"
Now say I create another dataframe, that has a mix of
entries which are exactly the same as the first df
entries which are partially present in the first df but have updated values
entries which are not present in the first df
-
df2 <- data.frame(value = c(1, 3, 5, 6),
ID = c("foo", "bar", "abc", "zzz"))
# value ID
# 1 1 foo # present in df1
# 2 3 bar # updated from df1
# 3 5 abc # not in df1
# 4 6 zzz # not in df1
Now, I want to update my SQLite table with this new dataframe;
currently existing entries which are exactly the same should be skipped
currently existing entries which are different should be updated
new entries should be appended
-
My best guess is that the code required would be structured like this:
if(! dbExistsTable(con, table_name)){
# write the df2 to table, if the table doesn't already exist
dbWriteTable(con, table_name, df2, row.names = FALSE)
} else {
# update the entries in the table with the entries in df2
for(i in seq(nrow(df2))){
irow <- df2[i, ]
# check if irow is already in the table
# if its already in the table, update the table's irow entry if its different
# otherwise append irow to the table
# or break() if irow is already present and identical
}
}
But what code would be used to actually check if the row (irow
) is already present in the SQLite, and then update it? It seems like it might be some usage of dbBind()
but I haven't been able to find a working example of how to use it for this purpose, and the docs are not clear. Is this kind of method going to be efficient for millions of entries and an arbitrary number of columns?