4

How to remove the duplicated data in a dfs table?

I create a dfs table with some duplicated data with the following script:

login("admin", "123456")
if(existsDatabase("dfs://compoDB")){
    dropDatabase("dfs://compoDB")
}
ID = `a`a`a`a`a`a`a
date = 2017.08.07 2017.08.07 2017.08.08 2017.08.08 2017.08.09 2017.08.09 2017.08.09
x = 1 1 2 2 3 3 3
t = table(ID, date, x)

dbDate = database(, VALUE, 2017.08.07..2017.08.11)
dbID= database(, HASH, [SYMBOL, 8])
db = database("dfs://compoDB", COMPO, [dbDate, dbID])
pt = db.createPartitionedTable(t, `pt, `date`ID)
pt.append!(t)

There are some duplicated data in this table. Now I want to remove the duplicated data from it. How to do it?

2 Answers2

1

Ideas:

  • step1: fetch the data of the distributed table to the memory table t.
  • step2: de-duplicate the delete operation on the memory table t.
  • step3: delete the partition involved in the dfs table to de-duplicate the data.
  • step4: rewrite the deduplicated memory table t back to the distributed table.

The code for the above scenario can be written like this:

def DeletDuplicateData(DBname, TBname, days){
    for(day in days){
        pt = loadTable(DBname, TBname)
        t = select * from pt where date=day
        delete from t where  isDuplicated([ID, date, x], LAST)=true
        try{dropPartition(database(DBname), days)} catch(ex){print ex}
        pt.append!(t)
    }
}
DBname = "dfs://compoDB"
TBname = "pt"
days = 2000.01.01..2021.06.30
ploop(DeletDuplicateData{DBname, TBname}, days)
loop(DeletDuplicateData{DBname, TBname}, days)

Note: You can select one of ploop and loop to execute.

This is to de-duplicate the data from 2001.01.01 to 2021.06.30. If the computer memory resources are relatively large, parallel execution is recommended. If the computer memory resources are relatively small and the daily data volume is relatively large, serial operation is recommended.

After executing the above code, query the dfs table:

select * from loadTable("dfs://compoDB", `pt)

Result:

ID date x
a 2017.08.07 1
a 2017.08.08 2
a 2017.08.09 3
Polly
  • 603
  • 3
  • 13
0

You can try to combine the operation of delete and map. Deduplication of a dfs table can be done with the following statement:

delete from pt where  isDuplicated([ID, date, x], LAST)=true map
damie
  • 412
  • 2
  • 7