2

I am working with Azure Data Lake Analytics for the first time and I am unsure how to merge 2 datasets like I would with pandas in python.

I am merging two datasets that have different timestamps but I need to line them up if they are within a specific timespan. This is straight forward in python.

Example python code: pandas.merge_asof(trades, quotes, on='time', by='ticker', tolerance=pd.Timedelta('2ms'))

Full example can be found here: click here

It seems like I can use python in my script however, I am not sure how I would get two datasets into the python function to do this merge.

1 Answers1

1

There's several ways of doing this. Using Python you can use the Python reducer to go from U-SQL to Python. To run Python locally in ADL you will need to do some additional steps.

However I would recommend doing this in U-SQL. A quick way to do that would be to create a rounded DateTime column and then do an inner join on that:

@trades =
    SELECT *,
           // round up to nearest 2ms:
           new DateTime((time.Ticks + TimeSpan.FromMilliseconds(2).Ticks - 1) / TimeSpan.FromMilliseconds(2).Ticks * TimeSpan.FromMilliseconds(2).Ticks, time.Kind) AS RoundedTimeA
    FROM @trades;

@quotes =
    SELECT *,
           // round up to nearest 2ms:
           new DateTime((time.Ticks + TimeSpan.FromMilliseconds(2).Ticks - 1) / TimeSpan.FromMilliseconds(2).Ticks * TimeSpan.FromMilliseconds(2).Ticks, time.Kind) AS RoundedTimeB
    FROM @quotes;

@result = 
    SELECT *
    FROM @trades AS A INNER JOIN @quotes AS B ON
    A.RoundedTimeA == B.RoundedTimeB;
theadriangreen
  • 2,218
  • 1
  • 14
  • 14