0

This is a follow-up from the thread Extract rows with duplicate values in two or more fields but different values in another field and the goal is to do the same thing for bigger-than-RAM data with SQL with DuckDB. The dupKeyEx() function code in the above mentioned thread works only for data that comfortably fits in RAM.

Here are the essentials from that thread:

library(data.table)

customers <- structure(list(
  NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH", 
           "DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA", 
           "MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM", 
           "DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
           "B V RAMANA", "DOKKA RAJESH", "G TULASIRAM REDDY", 
           "MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA", 
           "DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA", 
           "CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK", 
           "VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR", 
           "MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
           "KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA", 
           "MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA", 
           "KUMMARI VENKATESHAM", "BHAVANI CONSRUCTIONS", 
           "UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI", "MARIAM SUDHAKAR", 
           "B ANNAPURNA", "VELPURI LAKSHMI SUJATHA", "DARBAR ASHOK", 
           "AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE", "MOHD CHAND PASHA",
           "PALERELLA RAMESH", "GEEDI SRINIVAS", "RAMAIAH SADU",
           "BIMAN BALAIAH", "KOTLA CHENNAMMA", "VENKATESHWARAN PONNAMBALAM"),
  DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977", 
          "24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969", 
          "23-09-1978", "15-08-1954", "09-10-1984", "15-01-1960", 
          "29-09-1984", "03-03-1975", "26-01-1979", "01-01-1964", 
          "21-01-1954", "01-05-1964", "12-03-1975", "12-12-1962", 
          "10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962", 
          "12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958", 
          "26-06-1979", "02-04-1974", "10-01-1975", "01-05-1964",
          "15-01-1960", "08-08-1977", NA, "05-04-1981", "29-08-1971",
          "24-01-1954", "12-12-1962", "28-06-1971", "03-01-1950",
          "23-06-1970", "20-02-1960", "05-07-1975", "10-01-1979", 
          "31-08-1982", "10-08-1983", "10-03-1964", "15-07-1958",
          "14-04-1969"),
  ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976, 
         502272, 502977, 502973, 502986, 502910, 502989, 502998, 502967, 
         502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639, 
         502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980, 
         502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877, 
         502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996, 
         502979, 502467, 502290), 
  PIN = c(500082, 500032, 500032, 500032, 500032, 500084, 500032, 500032,
          500032, 500032, 500032, 500033, 500032, 500084, 500084, 500032,
          5e+05, 500050, 500032, 500084, 500032, 500032, 500032, 500050,
          500032, 500032, 500045, 500032, 500084, 500032, 500032, 500084,
          500035, 500084, 500032, 500032, 500032, 500032, 500084, 500032,
          500084, 500033, 500084, 500032, 500032, 500032, 500084, 500032,
          500032, 500032)),
  .Names = c("NAME", "DOB", "ID", "PIN"),
  row.names = c(NA, -50L), class = c("data.table", "data.frame"))

The goal is to extract rows from the above table which have:

  • same values in two or more fields (here NAME and DOB); but
  • different values in another field (here ID); and
  • other columns should be left as they are

The working function is:

dupKeyEx <- function(DT, dup_cols, unique_cols) {
  cols <-  c(dup_cols, unique_cols)
  chr_cols <- cols[sapply(DT[, ..cols], is.character)]
  DT[, (chr_cols) := lapply(.SD, stringr::str_trim), .SDcols=chr_cols]
  mDT <-  DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
  ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
  setorderv(ans, c(dup_cols, unique_cols))
  return(ans)
}

which gives the desired result

dupKeyEx(customers, c("NAME", "DOB"), "ID")
    
                          NAME        DOB     ID    PIN
 1:                B ANNAPURNA 12-12-1962 502877 500084
 2:                B ANNAPURNA 12-12-1962 502878 500084
 3:                 B V RAMANA 15-01-1960 502902 500032
 4:                 B V RAMANA 15-01-1960 502910 500033
 5:               DARBAR ASHOK 03-01-1950 502548 500084
 6:               DARBAR ASHOK 03-01-1950 502639 500032
 7:            KOTLA CHENNAMMA 15-07-1958 502466 500032
 8:            KOTLA CHENNAMMA 15-07-1958 502467 500032
 9:            MARIAM SUDHAKAR 24-01-1954 502047 500032
10:            MARIAM SUDHAKAR 24-01-1954 502475 500032
11:    VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12:    VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
14: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032

However, this function is not applicable for very large tables which do not comfortably fit into RAM. So I was wondering if some SQL query can be designed to do the same job for very large tables. DuckDB from https://duckdb.org/ is a very efficient open-source analytical database and works very nicely with R. So we try importing the above table into a DuckDB database as under:

library(DBI)
library(duckdb)

dbFile <- "...path..to..database.../experimental_database.db"
con <- dbConnect(duckdb::duckdb(), dbFile)

# import the customers table into the database (in actual scenario, the very large bigger-than-RAM table will be imported into the database from a large csv or we may run SQL query directly on chunked parquet files, as mentioned at https://duckdb.org/docs/data/parquet) - for simplicity, here we just using the customers table to try to compose functional SQL query
dbWriteTable(con, "customers", customers)

# check if table is successfully imported
dbListTables(con)

# try reading the table
dbReadTable(con, "customers")

Now, a functional SQL query needs to be composed which will give the same result that was obtained with the dupKeEx() function mentioned above. If that SQL query runs fine and gives valid results with bigger-than-RAM tables, the goal of this thread will be considered accomplished.

Any help will be appreciated.

San
  • 518
  • 5
  • 14
  • I've been thinking about this for a few days, and ... I don't think SQL is a great language (by itself) for looking for changes programmatically in a table, in a "meta" way where it looks through each column in a table, looks for sameness, then looks for other columns in those repeatedness-rows for non-sameness. Compound this (significantly!) that you want sameness in two *or more* columns. I think it might be important to explain what you are ultimately trying to do, perhaps this method is not necessary for your desired end-result. (As it stands, I think it's too-bespoke for generality.) – r2evans Oct 18 '21 at 14:51
  • I've solved this for `csv` files using another open-source tool called `csvtk`. I raised the issue there and then solved it myself. Please visit https://github.com/shenwei356/csvtk/issues/169 to check the code. For repeated use, I've also used `AutoHotkey` to create a GUI for it. Still wondering if SQL solution would be convenient and practical for repeated use. – San Oct 19 '21 at 02:50
  • @r2evans This is an important operation in forensic accounting investigations (fraud analytics). It's explained in the links shown below. Just search for "duplicate key exclusion" on these sites: (1) https://ebrary.net/13412/business_finance/same-same-same_test (2) http://library.wbi.ac.id/repository/124.pdf (3) https://pqc.icai.org/assets/announcement_files/1550139967.pdf (4) https://bangaloreicai.org/images/icons/ITT/5.%20Computer%20Assisted%20Audit%20Technique%20(CAAT).pdf – San Oct 19 '21 at 03:21
  • That's interesting background, and a strong list of references (forgive me if I don't read through all of them :-). This does not change my opinion that SQL by itself may not be the best language for this; to do it will likely require either (a) hard-coding several iterative things in SQL, and/or (b) repeated queries for the same immediate purpose. – r2evans Oct 19 '21 at 11:25
  • 1
    I've also never liked SQL for analysing data. R is my first choice where most of my work is done. But most R packages can only handle data that fits into RAM. And R big-data packages usually fail to parse my csv files. Recently, the volume of data that I have to handle has increased so much that I'm exploring for new ways. DuckDB did a group-by count on 80 GB data on an 8GB RAM machine very comfortably whereas other methods were unsuccessful. And it can directly query chunked big-data parquet files with SQL. So I'm trying SQL only because big-data methods with R are unreliable for now. – San Oct 19 '21 at 15:07

1 Answers1

1

Here is a recipe how to get the table rows with duplicated values in some columns: https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/

If you got a large CSV and you already know the column(s) you will be using for searches/grouping you may split the CSV, sort the chunks, then merge sorted and save as parquet.

darked89
  • 332
  • 1
  • 2
  • 17
  • The referred link gives SQL query to extract rows containing duplicate values in specified columns, but my query has **an additional criteria**, i.e., to "extract rows with duplicate values in two or more fields but different values in another field". – San Nov 30 '21 at 08:06