Is there any standard / boilerplate way to convert SQL uniqueidentifiers into the same HI/LO blocks as protobuf-net's BCL.Guids?
UPDATE:
So, given a Guid, "4D1CE8BE-C36B-4FFA-A4C8-9056619E9967"
, ProtoBuf.NET will serialize it as {"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}
, which is how it subsequently gets stored in Hadoop. So SQL server has a column with a Guid id, and Hadoop ends up having separate id.lo and id.hi values. What I need is a query to look up a record in Hadoop given the id from SQL server. The C# code below gives me the translated value I'm after, but it requires me to run an intermediate client app simply to split the Guid. What I want to be able to get the id.lo and id.hi directly from a SQL query, so I can drop them straight into a filter in a subsequent Hadoop Map-Reduce job:
static void TestGuidSplit() {
// Protobuf.Net serializes the following Guid as:
// {"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}
Guid testGuid = new Guid("4D1CE8BE-C36B-4FFA-A4C8-9056619E9967");
Tuple<long, long> loHi = LoHi(testGuid);
Console.WriteLine("lo: {0}, Hi:{1}.", loHi.Item1,loHi.Item2);
Console.ReadLine();
}
static Tuple<long, long> LoHi(Guid someGuid) {
byte[] bytes = someGuid.ToByteArray();
long[] longs = new long[2];
longs[0] = BitConverter.ToInt64(bytes, 0); // [0] = 5763133538796628158 = lo
longs[1] = BitConverter.ToInt64(bytes, 8); // [1] = 7465171998244653220 = hi
return new Tuple<long, long>(longs[0], longs[1]);
}
ORIGINAL QUESTION:
I have a ton of data in SQL Server that I need to join to data that was exported to Hadoop via protobuf-net. We have keys in common, but they are all uniqueidentifiers. With homogeneously-formatted keys, we could just extract a flat file from SQL, create a tabledef over that flat file in Hive, and then use Hive to run a joined query over that and the hadoop tables. With the disparate uuid formats though, this doesn't seem possible.
Do we need an intermediate process to harmonize the uuid formats? I'm hoping there's a way around this, i.e. to do it more like a simple ETL process. Ultimately, we just want to get at a set of hadoop data where some_id in (list of sql ids). The hadoop data is of unmanageable size if extracted without filtering down to the relevant ids from sql.
The simplest example I can think of to describe what I'm trying to do would be to imagine that I had two tables in SQL server, 'a' and 'b', where 'a' contained several uuid fields, and 'b' was a copy of 'a' except that the uuids were now 64-bit integers, HI/LO, from protobuf-net's bcl.guid. Given that scenario, I want to select * from 'b' where someid in (select someid from 'a' where interesting = true). What I'm missing is a function to get the HI and/or LO of someid from 'a', to supply to the in clause for my query from 'b'.