This is a follow-up from the thread Extract rows with duplicate values in two or more fields but different values in another field
and the goal is to do the same thing for bigger-than-RAM data with SQL with DuckDB. The dupKeyEx()
function code in the above mentioned thread works only for data that comfortably fits in RAM.
Here are the essentials from that thread:
library(data.table)
customers <- structure(list(
NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH",
"DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA",
"MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM",
"DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
"B V RAMANA", "DOKKA RAJESH", "G TULASIRAM REDDY",
"MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA",
"DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA",
"CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK",
"VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR",
"MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
"KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA",
"MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA",
"KUMMARI VENKATESHAM", "BHAVANI CONSRUCTIONS",
"UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI", "MARIAM SUDHAKAR",
"B ANNAPURNA", "VELPURI LAKSHMI SUJATHA", "DARBAR ASHOK",
"AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE", "MOHD CHAND PASHA",
"PALERELLA RAMESH", "GEEDI SRINIVAS", "RAMAIAH SADU",
"BIMAN BALAIAH", "KOTLA CHENNAMMA", "VENKATESHWARAN PONNAMBALAM"),
DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977",
"24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969",
"23-09-1978", "15-08-1954", "09-10-1984", "15-01-1960",
"29-09-1984", "03-03-1975", "26-01-1979", "01-01-1964",
"21-01-1954", "01-05-1964", "12-03-1975", "12-12-1962",
"10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962",
"12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958",
"26-06-1979", "02-04-1974", "10-01-1975", "01-05-1964",
"15-01-1960", "08-08-1977", NA, "05-04-1981", "29-08-1971",
"24-01-1954", "12-12-1962", "28-06-1971", "03-01-1950",
"23-06-1970", "20-02-1960", "05-07-1975", "10-01-1979",
"31-08-1982", "10-08-1983", "10-03-1964", "15-07-1958",
"14-04-1969"),
ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976,
502272, 502977, 502973, 502986, 502910, 502989, 502998, 502967,
502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639,
502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980,
502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877,
502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996,
502979, 502467, 502290),
PIN = c(500082, 500032, 500032, 500032, 500032, 500084, 500032, 500032,
500032, 500032, 500032, 500033, 500032, 500084, 500084, 500032,
5e+05, 500050, 500032, 500084, 500032, 500032, 500032, 500050,
500032, 500032, 500045, 500032, 500084, 500032, 500032, 500084,
500035, 500084, 500032, 500032, 500032, 500032, 500084, 500032,
500084, 500033, 500084, 500032, 500032, 500032, 500084, 500032,
500032, 500032)),
.Names = c("NAME", "DOB", "ID", "PIN"),
row.names = c(NA, -50L), class = c("data.table", "data.frame"))
The goal is to extract rows from the above table which have:
- same values in two or more fields (here NAME and DOB); but
- different values in another field (here ID); and
- other columns should be left as they are
The working function is:
dupKeyEx <- function(DT, dup_cols, unique_cols) {
cols <- c(dup_cols, unique_cols)
chr_cols <- cols[sapply(DT[, ..cols], is.character)]
DT[, (chr_cols) := lapply(.SD, stringr::str_trim), .SDcols=chr_cols]
mDT <- DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
setorderv(ans, c(dup_cols, unique_cols))
return(ans)
}
which gives the desired result
dupKeyEx(customers, c("NAME", "DOB"), "ID")
NAME DOB ID PIN
1: B ANNAPURNA 12-12-1962 502877 500084
2: B ANNAPURNA 12-12-1962 502878 500084
3: B V RAMANA 15-01-1960 502902 500032
4: B V RAMANA 15-01-1960 502910 500033
5: DARBAR ASHOK 03-01-1950 502548 500084
6: DARBAR ASHOK 03-01-1950 502639 500032
7: KOTLA CHENNAMMA 15-07-1958 502466 500032
8: KOTLA CHENNAMMA 15-07-1958 502467 500032
9: MARIAM SUDHAKAR 24-01-1954 502047 500032
10: MARIAM SUDHAKAR 24-01-1954 502475 500032
11: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
14: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
However, this function is not applicable for very large tables which do not comfortably fit into RAM. So I was wondering if some SQL query can be designed to do the same job for very large tables. DuckDB from https://duckdb.org/ is a very efficient open-source analytical database and works very nicely with R. So we try importing the above table into a DuckDB database as under:
library(DBI)
library(duckdb)
dbFile <- "...path..to..database.../experimental_database.db"
con <- dbConnect(duckdb::duckdb(), dbFile)
# import the customers table into the database (in actual scenario, the very large bigger-than-RAM table will be imported into the database from a large csv or we may run SQL query directly on chunked parquet files, as mentioned at https://duckdb.org/docs/data/parquet) - for simplicity, here we just using the customers table to try to compose functional SQL query
dbWriteTable(con, "customers", customers)
# check if table is successfully imported
dbListTables(con)
# try reading the table
dbReadTable(con, "customers")
Now, a functional SQL query needs to be composed which will give the same result that was obtained with the dupKeEx()
function mentioned above. If that SQL query runs fine and gives valid results with bigger-than-RAM tables, the goal of this thread will be considered accomplished.
Any help will be appreciated.