1

Is there any standard / boilerplate way to convert SQL uniqueidentifiers into the same HI/LO blocks as protobuf-net's BCL.Guids?

UPDATE:

So, given a Guid, "4D1CE8BE-C36B-4FFA-A4C8-9056619E9967", ProtoBuf.NET will serialize it as {"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}, which is how it subsequently gets stored in Hadoop. So SQL server has a column with a Guid id, and Hadoop ends up having separate id.lo and id.hi values. What I need is a query to look up a record in Hadoop given the id from SQL server. The C# code below gives me the translated value I'm after, but it requires me to run an intermediate client app simply to split the Guid. What I want to be able to get the id.lo and id.hi directly from a SQL query, so I can drop them straight into a filter in a subsequent Hadoop Map-Reduce job:

static void TestGuidSplit() {
            // Protobuf.Net serializes the following Guid as:
            // {"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}
            Guid testGuid = new Guid("4D1CE8BE-C36B-4FFA-A4C8-9056619E9967");
            Tuple<long, long> loHi = LoHi(testGuid);
            Console.WriteLine("lo: {0}, Hi:{1}.", loHi.Item1,loHi.Item2);
            Console.ReadLine();
        }
        static Tuple<long, long> LoHi(Guid someGuid) {
            byte[] bytes = someGuid.ToByteArray();
            long[] longs = new long[2];
            longs[0] = BitConverter.ToInt64(bytes, 0); // [0] = 5763133538796628158 = lo
            longs[1] = BitConverter.ToInt64(bytes, 8); // [1] = 7465171998244653220 = hi
            return new Tuple<long, long>(longs[0], longs[1]);
        }

ORIGINAL QUESTION:

I have a ton of data in SQL Server that I need to join to data that was exported to Hadoop via protobuf-net. We have keys in common, but they are all uniqueidentifiers. With homogeneously-formatted keys, we could just extract a flat file from SQL, create a tabledef over that flat file in Hive, and then use Hive to run a joined query over that and the hadoop tables. With the disparate uuid formats though, this doesn't seem possible.

Do we need an intermediate process to harmonize the uuid formats? I'm hoping there's a way around this, i.e. to do it more like a simple ETL process. Ultimately, we just want to get at a set of hadoop data where some_id in (list of sql ids). The hadoop data is of unmanageable size if extracted without filtering down to the relevant ids from sql.

The simplest example I can think of to describe what I'm trying to do would be to imagine that I had two tables in SQL server, 'a' and 'b', where 'a' contained several uuid fields, and 'b' was a copy of 'a' except that the uuids were now 64-bit integers, HI/LO, from protobuf-net's bcl.guid. Given that scenario, I want to select * from 'b' where someid in (select someid from 'a' where interesting = true). What I'm missing is a function to get the HI and/or LO of someid from 'a', to supply to the in clause for my query from 'b'.

Paul Smith
  • 3,104
  • 1
  • 32
  • 45
  • I suspect some kind of example of the problem you are trying to solve here would help. Also, I'm not quite sure of what role protobuf-net is playing here... – Marc Gravell Jun 30 '11 at 06:36
  • Hi Marc. Protubuf-net serializes the data on its way to Hadoop, where it's stored natively in Protobuf format. The protobuf message is similar to an IIS log entry with a few guids added (which are foreign keys to data we have in SQL). The guids are serialized as bcl.guid. The goal now is to extract a subset of those log entries based on ID. If everything were in SQL server, we would just be looking at something like `SELECT * from weblogs where fooid in (select id from foo where interesting = 1)`. Does that make any more sense? – Paul Smith Jun 30 '11 at 13:33
  • @Paul I'm not hugely familiar with hadoop. However, if your field is field 1, it'll always be written first, so you should if you have access to a binary-safe "starts-with" operator? – Marc Gravell Jun 30 '11 at 15:33
  • @Marc I appended a (hopefully) more straightforward example to the question. I don't know that this can be done, but I tried to describe my perfect-world scenario, taking hadoop out of the equation. – Paul Smith Jun 30 '11 at 18:21
  • @Paul if you just want something that takes a Guid and returns two numbers, I can do that - is that what you need? I'm happy to help, I just want to be very clear on "with what?" before I dive in... – Marc Gravell Jun 30 '11 at 18:32
  • @Marc, that would be awesome, *especially* if it could be done in T-SQL. If that's not possible though, I could do an extract using C#. – Paul Smith Jul 01 '11 at 20:33
  • I probably *can*, although TSQL makes it harder.... but wouldn't it be easier to store the guids in a column? – Marc Gravell Jul 01 '11 at 20:57
  • @Marc, I'm not clear on where you mean re: storing the guids in a column. In SQL server, or the extract? – Paul Smith Jul 03 '11 at 21:59
  • with the edit... can you be clearer how you are getting `{"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}` ? because that isn't something that protobuf-net will **ever** produce (it isn't JSON, and members don't have names) – Marc Gravell Jun 13 '12 at 06:15
  • Hmm, well that's honestly baffling. Maybe that's just how hive reads back the hi/lo struct out of hadoop, but we're simply using the default semantics of protobuf-net, the bcl.proto and protogen to serialize the thing. From there we send it to a Java IPC process that just looks at the message name & then writes the bytes into hdfs. At any rate, the encapsulation isn't really an issue for me, I can unwrap that easily enough. I just long for a way to derive the hi/lo values from the uniqueidentifier stored in SQL server, using SQL (i.e. with no intermediate client). :) – Paul Smith Jun 13 '12 at 19:24
  • 1
    added; let me know how you get on – Marc Gravell Jun 13 '12 at 20:17
  • Note that the C# code as stated above is not working 100% correct, depending on the GUID. The working C# solution can be found here: https://stackoverflow.com/a/71510915/1099519 – DominikAmon Mar 17 '22 at 10:41

1 Answers1

2

Conversion in SQL from a SQL uniqueidentifier column to the same hi / lo values generated by protobuf-net:

declare @guid uniqueidentifier = convert(uniqueidentifier, '4D1CE8BE-C36B-4FFA-A4C8-9056619E9967')
select @guid as 'guid' -- writes: 4D1CE8BE-C36B-4FFA-A4C8-9056619E9967, to prove it parsed correctly
declare @blob binary(16) = CONVERT(binary(16), @guid)
select CAST(SUBSTRING(@blob, 8, 1) + SUBSTRING(@blob, 7, 1) + SUBSTRING(@blob, 6, 1) + SUBSTRING(@blob, 5, 1) +
       SUBSTRING(@blob, 4, 1) + SUBSTRING(@blob, 3, 1) + SUBSTRING(@blob, 2, 1) + SUBSTRING(@blob, 1, 1) as bigint) as 'lo',
       CAST(SUBSTRING(@blob, 16, 1) + SUBSTRING(@blob, 15, 1) + SUBSTRING(@blob, 14, 1) + SUBSTRING(@blob, 13, 1) +
       SUBSTRING(@blob, 12, 1) + SUBSTRING(@blob, 11, 1) + SUBSTRING(@blob, 10, 1) + SUBSTRING(@blob, 9, 1) as bigint) as 'hi'
       -- writes: 5763133538796628158, 7465171998244653220

Note you can probably wrap this as a UDF somewhere...

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900