Generate unique ID in Java, to label groups of related entries in a log

Question

There are several posts on SO on this topic. Each of those talk about a specific approach so wanted to just get a comparison in one question.

Using new Date() as unique identifier

Generating a globally unique identifier in Java

I am trying to implement a feature where we are able to identify certain events in the log file. These events need to be associated with a unique id. I am trying to come up with a strategy for this unique ID generation. The ID has to have 2 parts : some static information + some dynamic information The logs can be searched for the pattern when debugging of events is needed. I have three ways :

static info + Joda Date time("abc"+2014-01-30T12:36:12.703)
static info + Atomic Integer
static info + UUID

For the scope of this question, multiple JVMs is not a consideration. I need to generate unique IDs in an efficient manner on one JVM. Also, I will not be able to use a database dependent solution.

Which of the 3 above mentioned strategies works best ?

If not one from the above, any other strategy ?
Is the Joda time based strategy robust ? The JVM is single but there will be concurrent users so there can be concurrent events.
In conjunction with one of the above/other strategies, Do I need to make my method thread-safe / synchronized ?

"works best" : I need to generate unique IDs in an efficient manner on one JVM — souser, Feb 03 '14 at 20:12
Take a look at [UUID.randomUUID()](http://docs.oracle.com/javase/7/docs/api/java/util/UUID.html#randomUUID()). — Mr. Polywhirl, Feb 03 '14 at 20:18
Thanks, but how does that compare with the Joda time option? — souser, Feb 03 '14 at 20:20
That is why I left this as a comment, as it does not fully answer your question. Just a tip. — Mr. Polywhirl, Feb 03 '14 at 20:20
@Mr.Polywhirl UUID.randomUUID() is a synchronized method and does not perform well under high contention. Creating Joda objects for every identifier appears rather wasteful too. I'd use System.currentTimeMillis() or nanoTime() + atomic counter. — Ralf, Feb 03 '14 at 20:49
@souser Are saying you want to tie together multiple entries in a log? The entries may be interleaved with other unrelated entries, so you want to be able to query for just the related entries – is that your question? — Basil Bourque, Feb 03 '14 at 23:33
@Ralf Do you have numbers to cite? In modern JVMs, a synchronized method is not expensive. It is hard to imagine that in the context of logging the OP would be generating enough UUIDs to make any impact on real-world performance. My test: A tight loop of a million calls to `java.util.UUID.randomUUID()` in Java 8 beta 127 from Netbeans 7.4 running inside a Parallels 9 virtual machine running Mountain Lion on a Mac mini (Intel i7) running Mavericks. Results: **2 milliseconds per UUID**. Granted that is without contention, but nevertheless I would say your concern is "premature optimization". — Basil Bourque, Feb 03 '14 at 23:55
@BasilBourque Thanks for your inputs. Not sure what "tying together" means. To simplify : One event, generate & log one id related to that event, look it up using the generated id. Hope that makes it clear — souser, Feb 04 '14 at 00:03
@souser If you have one single entry you need to find in your log, slap a UUID on that entry and you are done. But your question seems to be more than that. Do mean something like Susan (or Thread 'A') is running code that makes several log entries, and Bob (or Thread 'B') is running that same code, and given that the log entries look confusingly alike you want to be able to find the collection of Susan's entries from just that one run of code while ignoring Bob's entries? — Basil Bourque, Feb 04 '14 at 00:14
@BasilBourque ;) Susan and Bob, both are concurrently accessing the application. During the course of their work, events were raised and logged. I need to be able to distinctly identify events that are related to Susan/Bob . And thus I have some mechanism in place to do that, just need to make sure that the logging part uses distinct ids. UUID seems to fit the bill. — souser, Feb 04 '14 at 00:23
Regarding my comment above about testing the speed of generating UUIDs mentioned above, see [this other answer of mine](http://stackoverflow.com/a/21540935/642706) for more information about that test as well as extended test of contention where 3 threads all simultaneously generate million of UUIDs. Conclusion: Contention makes no real-world impact on performance. — Basil Bourque, Feb 04 '14 at 00:56
@BasilBourque I will look at the link. Also, I would certainly value your inputs on my comment to DwB's answer — souser, Feb 04 '14 at 01:02
@BasilBourque Well, if that is the case then the whole question is "premature optimization". We also create correlation IDs for logging. That does not mean that every ID gets logged. We concurrently process >10'000 events per second and the randomUUID() was a bottleneck for us (2ms is ages, really). Check out this [article](http://mechanitis.blogspot.ch/2011/07/dissecting-disruptor-why-its-so-fast.html) and other LMAX/disruptor related articles to find numbers about the cost of locking, with and without contention. — Ralf, Feb 04 '14 at 07:08
@Ralf Okay, now I see your point. At a scale of 10,000 events per second, your concerns may be warranted. That's beyond the experience of me and my colleagues (in-house corporate departments or niche web sites). — Basil Bourque, Nov 26 '14 at 15:23

score 10 · Accepted Answer · edited May 23 '17 at 12:08

I have had the same need as you, distinguishing a thread of related entries interleaved with other unrelated entries in a log. I have tried all three of your suggested approaches. My experience was in 4D not Java, but similar.

Date-Time

In my case, I was using a date-time value resolved to whole seconds. That is simply too large a granularity. I easily had collisions where multiple events started within the same second. Damn those speedy computers!

In your case with either the bundled java.util.Date or Joda-Time (highly recommended for other purposes), both resolve to milliseconds. A millisecond is a long time in modern computers, so I don't recommend this.

In Java 8, the new java.time.* package (inspired by Joda-Time, defined by JSR 310) resolve to nanoseconds. This might seem to be a better identifier, but no. For one thing, your computer's physical time-keeping clock may not support such a fine resolution. Another is that computers keep getting faster. Lastly, a computer's clock can be reset, indeed it is reset often as computer clocks drift quite a bit. Modern OSes reset their clocks by frequently checking with a time server either locally or over the Internets.

Also, logs already have a timestamp, so we are not getting any extra benefit by using a date-time as our identifier. Indeed, having a second date-time in the log entry may actually cause confusion.

Serial Number

By "Atomic Integer", I assume you mean a serial number incrementing to increasing numbers.

This seems overkill for your purpose.

You don't care about the sequence, it has no meaning for this purpose of grouping log entries. You don't really care if one group came nth number before or after another group.
Maintaining a sequence is a pain, a point of potential failure. I've always eventually ran into administrative problems with maintaining a sequence.

So this approach adds risk without adding any special benefit.

UUID

Bingo! Just what you need.

A UUID is easily generated, using either the bundled java.util.UUID class' ability to generate Version 3 or 4 UUIDs, or using a third-party library, or accessing the command-line's uuidgen tool.

For a very high volume, [Version 1] UUID (MAC + date-time + random number) would be best. For logging, a Version 4 UUID (entirely random) is absolutely acceptable.

Having a collision is not a realistic concern. Especially for the limited number of values you would be generating for logs. I'm amazed by people who, failing to comprehend the numbers, say they would never replace a sequence with a UUID. Yet when pressed, every single programmer and sysadmin I know has experienced failures with at least one sequence.

No concerns about thread-safety. No concerns about contention (see my test results on another answer of mine).

Another benefit of a UUID is that its usual hexadecimal representation, such as:

6536ca53-bcad-4552-977f-16945fee13e2

…is easily recognizable. When recognized, the reader immediately knows that string is meant to be a unique identifier. So it's presence in your log is self-documenting.

I've found UUIDs to be the Duct Tape of computing. I keep finding new uses for them.

So, at the start of the code in question, generate a UUID and then embed that into every one of the related log entries.

While the hex string representation of a UUID is hard to read and write, in practice you need only scan a few of the digits at the beginning or end. Or use copy-paste with search and filter features in our modern console tools.

A few factoids

A UUID is known in the Microsoft world as as a GUID.
A UUID is not a string, but a 128-bit value. Bits, just bits in memory, "on"/"off" values. Some databases, such as Postgres, know how to handle and store UUID as such 128-bit values. If we wish to show those bits to humans, we could use a series of 128 digits of "1" & "0". But humans do not do well trying to read or write 128 digits of ones and zeros. So we use the hexadecimal representation. But even 32 hex digits is too much for humans, so we break the string into groups separated with hyphens as shown above, for a total of 36 characters.
The spec for a UUID is quite clear that a hexadecimal representation should be lowercase. The spec says that when creating a UUID from a string input, uppercase should be tolerated. But when generating a hex string, it should be lowercase. Many implementations of UUIDs ignore this requirement. I suggest sticking to the spec and converting your UUID hex strings to lowercase.

MDC – Mapped Diagnostic Context

I have not yet used MDC, but want to point it out…

Some logging frameworks are adding support for this idea of tagging related log entries. Such support is called Mapped Diagnostic Context (MDC). The MDC manages contextual information on a per thread basis.

A quick introductory article is Log4j MDC (Mapped Diagnostic Context) : What and Why .

The best logging façade, SLF4J, offers such an MDC feature. The best implementation of that façade, Logback, has a chapter documenting its MDC feature.

Thank you for your time and efforts Basil. I am sure this will help a lot of people. Most importantly, I have the satisfaction of learning something comprehensively. Paul and DWB certainly gave right answers but I had to choose this as right for its comprehensive coverage. Thanks to all who contributed. — souser, Feb 04 '14 at 02:10

score 7 · Answer 2 · answered Feb 03 '14 at 21:29

7

Computers are fast, using time to attempt to create a unique value is going to fail.

Instead use a UUID. From the JSE 6.0 UUID API page "[UUID is] A class that represents an immutable universally unique identifier (UUID)."

Here is some code:

import java.util.UUID;

private String id;

id = UUID.randomUUID().toString();

answered Feb 03 '14 at 21:29

DwB

37,124
11
56
82

+1 for the explanation. At this point UUID seems a strong contender as it meets the 2 criteria of uniqueness and thread safety. Just to have the complete answer, how would you compare this with something like : static data+current time+ atomicinteger.incrementAndGet() http://stackoverflow.com/questions/4818699/practical-uses-for-atomicinteger – souser Feb 04 '14 at 00:06
1

I don't like to reinvent the wheel. UUID is already a solid solution (in my opinion) so I would not attempt to create another. – DwB Feb 04 '14 at 01:14
@souser Your comment's approach is basically reinventing UUID! Except that UUID generators go further, such as tracking if the system's clock gets set backward and if so then increment a little revolving number to further reduce possibility of duplicates. So, just use a UUID and save your programming efforts for creating *new* software rather than reinventing well-worn, tested, and debugged code. – Basil Bourque Nov 26 '14 at 15:31

score 0 · Answer 3 · answered Jun 12 '14 at 16:59

I have written a simple service which can generate semi-unique non-sequential 64 bit long numbers. It can be deployed on multiple machines for redundancy and scalability. It use ZeroMQ for messaging. For more information on how it works look at github page: zUID