3

I need the SQL equivalent of an AUTO_INCREMENT id in hadoop.

When my reduce task identifies a new item, those items needs a unique ID assigned.

  • How can I share an atomic counter across the cluster? The reporter counters seem to be just increment counters, there's no getAndIncrement feature that I see.

  • How can I set that counter before the map/reduce phase of the job starts?

David Parks
  • 30,789
  • 47
  • 185
  • 328
  • 2
    possible duplicate of [Distributed sequence number generation?](http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation) – Praveen Sripati Oct 27 '12 at 05:23

1 Answers1

2

To perform distributed id generation you can either just generate uuids or use functionality found in Apache Zookeeper, which can do distributed coordination on Hadoop clusters. Disclaimer: I have never used Zookeeper, so I don't know if you can really (even theoretically) get a global contiguous set of ids, which is what the question seems to be asking.

Generating UUIDs does have a cost, though; they take some time to generate.

For good general information on distributed ID generation, see this Stack Overflow question.

halfer
  • 19,824
  • 17
  • 99
  • 186
Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • Yeh, they have to be incrementing ID's in a specific range not just unique. – David Parks Oct 27 '12 at 03:40
  • I thought that is what you wanted. Check out zookeeper then. While I've done a lot with hadoop, I've always generated UUIDs because the very thought of building in a global atomic integer just seemed weird. On a 1,000 node cluster you want 999 machines to wait? Seriously, I expect that the Zookeeper people figured this all out, however intractable it seems. If you can't get what you want, generate uuids in the map phase then create a contiguous set in the reduce phase, or in a separate sequential process _after_ your MR jobs complete. – Ray Toal Oct 27 '12 at 04:48