3

I am ingesting data from CSV using java-api. I have to maintain the primary key of each document.

Does marklogic provide any unique auto-generated id during insert ?

If marklogic do not provide then i can think of one thing which is random generated hexString number but problem is if i have large number of record in CSV to ingest, sometime this random number might repeat.

Please suggest me how to proceed with this use case.

Cœur
  • 37,241
  • 25
  • 195
  • 267
RCS
  • 1,370
  • 11
  • 27

2 Answers2

3

The advised approach is to use the randomly generated ID values, of sufficient length that the chance of collision is impractical for your data set size. Because you're human you're still going to be tempted to check for collisions, but math says it's simply wasteful. If you're using a 64-bit random value then you have 50/50 odds of a collision after 4 billion. Too risky? Use a 128-bit random value if that's worrisome because then it's 50/50 odds after 18 quadrillion. See "Probability of 64-bit hash code collisions"

Community
  • 1
  • 1
hunterhacker
  • 6,378
  • 1
  • 14
  • 11
  • Thanks hunterhacker for reply. – RCS Jun 20 '16 at 15:20
  • 1
    @hunterhacker This is not true in all the cases right. Being a database it should be able to provide a way to generate a sequence/random numbers. Like in Mongo DB it generates _id object for every insert which is unique. Isn't there something similar in MarkLogic? – DMA Jun 20 '16 at 15:58
  • Is there a way to reliably generate a sequence? Absolutely! Do you want to do it for unique IDs? No. Because it adds overhead for no benefit, and there's WAY better odds your smart code has a bug than a large random value collides with another. MongoDB doesn't do in-database checking, btw. The Mongo _id is built *client-side* and "is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter" per http://stackoverflow.com/questions/5817795/how-are-mongodbs-objectids-generated. Reset your clock and good luck... – hunterhacker Jun 21 '16 at 16:51
2

xdmp:random() is a 64 bit pseudo random generator (PRNG) with the properties of such, using FIPS compliant implementation when available. It is the same as used internally for generating document and fragment IDS. So in practice you cannot do better wrt to efficient generation of unique ids. And yes this is something that most people find difficult to accept at first (myself included).
Now that is not the same necessarily as guaranteeing in some specific context and that your use of this generates unique URIs (which is ML's version of a GUID or database wide 'primary key'). To do that you have to guarantee that the only source of URIs are those you generate and that you make full use of all 64 bits. If you want a proof to yourself that its absolutely unique no matter what is going on, then you need a transactional atomic counter of some sort. Those are easily made (a document read-update-write-commit of single shared document) but that is horrendously slow at scale.

Another alternative if the data is batch uploaded from CSV is to use the offset (row or line #) of the record as part of the URL, and something unique about each file, like its filename.
Often CSV data itself has a column or combination of columns that represent a primary key for that dataset. That can be used as well.

DALDEI
  • 3,722
  • 13
  • 9