delete duplicate entities from BaseX with a primary key

Question

How do I remove duplicate entries?

In this case, Twitter uses snowflake's as a unique ID field in the JSON for the tweet, which is reflected below. Build an index of tweets, by ID, and delete each additional $tweet which has a duplicate? But how to not end up deleting every single tweet recursively...?

A simple FLWOR to print the ID for each tweet in the database:

for $tweets  in db:open("twitter")
return <tweet>{$tweets/json/id__str}</tweet>

results:

thufir@dur:~/flwor/twitter$ 
thufir@dur:~/flwor/twitter$ basex tweetID.xq 
<tweet>
  <id__str>1224165280068382720</id__str>
</tweet>
<tweet>
  <id__str>1224160851797643264</id__str>
</tweet>
<tweet>
  <id__str>1224134565280862208</id__str>
</tweet>
...
<tweet>
  <id__str>1224016596634603520</id__str>
</tweet>
<tweet>
  <id__str>1224001430417297410</id__str>
</tweet>
<tweet>
  <id__str>1223987662094249991</id__str>
</tweet>thufir@dur:~/flwor/twitter$ 
thufir@dur:~/flwor/twitter$

The duplicates here are intentional, but looking for some way to "clean up" the database.

Just looking for a general outline or approach. My thinking is to pipe output from one xquery into another, but am stuck at how to build an index. Presumably this is a functionality built into the database itself, just need to leverage the correct module (probably).

--

this looks to return a different result, at least:

distinct-values(
    for $tweets in db:open("twitter") 
    return ($tweets/json/id__str))

although I'm not quite sure it's the set of all id__str values.

line-o · Answer 1 · 2020-11-30T23:03:38.630

2

You can use group by in a FLOWR to get duplicates. It should also be faster than distinct().

for $tweets  in db:open("twitter")
let $id := $tweets/json/id__str
group by $id
return
  if (count($tweets) > 1)
  then (for-each(tail($tweets), function ($tweet) { (: remove $tweet from DB :) } ) 
  else () (: nothing to do :)

edited Nov 30 '20 at 23:03

answered Feb 04 '20 at 10:05

line-o

1,885
3
16
33

1

what is `function ($t)`? – Thufir Nov 27 '20 at 03:56
`$t` was short for `$tweet`. I made the code example a little more clear. – line-o Nov 30 '20 at 23:04

score 1 · Accepted Answer · answered Nov 11 '20 at 07:58

I had the same problem and tested two approaches that were already discussed here. Both approaches can be used to delete duplicates but there are performance differences.

The distinct-values approach:

(: Open database :)
let $db := db:open('db-name')

(: Get all distinct IDs :)
let $ids := distinct-values($db/record/id)

for $id in $ids
  (: Get all records with the same ID :)
  let $recsWithSameId := data($db/record/id)=$id

  (: Return only duplicate records :)
  return if (count($recsWithSameId)>1) then
    $recsWithSameId

    (: Instead of returning the duplicate records you can now delete all records except the one you want to keep. Then you removed the duplicates. :)

The group by approach:

for $recs in db:open('db-name')/record
  let $id := $recs/id
  group by $id
  return
    if (count($recs) > 1) then      
      $recs
      (: Instead of returning the duplicate records you can now delete all records except the one you want to keep. Then you removed the duplicates. :)

The second approach (group by) is much faster than the first one (distinct-values).

delete duplicate entities from BaseX with a primary key

2 Answers2