How do I remove duplicate entries?
In this case, Twitter uses snowflake's as a unique ID field in the JSON
for the tweet, which is reflected below. Build an index of tweets, by ID, and delete each additional $tweet
which has a duplicate? But how to not end up deleting every single tweet recursively...?
A simple FLWOR
to print the ID for each tweet in the database:
for $tweets in db:open("twitter")
return <tweet>{$tweets/json/id__str}</tweet>
results:
thufir@dur:~/flwor/twitter$
thufir@dur:~/flwor/twitter$ basex tweetID.xq
<tweet>
<id__str>1224165280068382720</id__str>
</tweet>
<tweet>
<id__str>1224160851797643264</id__str>
</tweet>
<tweet>
<id__str>1224134565280862208</id__str>
</tweet>
...
<tweet>
<id__str>1224016596634603520</id__str>
</tweet>
<tweet>
<id__str>1224001430417297410</id__str>
</tweet>
<tweet>
<id__str>1223987662094249991</id__str>
</tweet>thufir@dur:~/flwor/twitter$
thufir@dur:~/flwor/twitter$
The duplicates here are intentional, but looking for some way to "clean up" the database.
Just looking for a general outline or approach. My thinking is to pipe output from one xquery
into another, but am stuck at how to build an index. Presumably this is a functionality built into the database itself, just need to leverage the correct module (probably).
--
this looks to return a different result, at least:
distinct-values(
for $tweets in db:open("twitter")
return ($tweets/json/id__str))
although I'm not quite sure it's the set of all id__str
values.