U-SQL Output in Azure Data Lake

Question

Would it be possible to automatically split a table into several files based on column values if I don't know how many different key values the table contains? Is it possible to put the key value into the filename?

score 12 · Answer 1 · edited Sep 11 '17 at 17:30

12

This is our top ask (and has been previously asked on stackoverflow too :). We are currently working on it and hopefully have it available by summer.

Until then you have to write a script generator. I tend to use U-SQL to generate the script but you could do it with Powershell or T4 etc.

Here is an example:

Let's assume you want to write files for the column name in the following table/rowset @x:

name | value1 | value2
-----+--------+-------
A    | 10     | 20
A    | 11     | 21
B    | 10     | 30
B    | 100    | 200

You would write a script to generate the script like the following:

@x = SELECT * FROM (VALUES( "A", 10, 20), ("A", 11, 21), ("B", 10, 30), ("B", 100, 200)) AS T(name, value1, value2);

// Generate the script to do partitioned output based on name column:

@stmts = 
  SELECT "OUTPUT (SELECT value1, value2 FROM @x WHERE name == \""+name+"\") TO \"/output/"+name+".csv\" USING Outputters.Csv();" AS output 
  FROM (SELECT DISTINCT name FROM @x) AS x;

OUTPUT @stmts TO "/output/genscript.usql" 
USING Outputters.Text(delimiter:' ', quoting:false);

Then you take genscript.usql, prepend the calculation of @x and submit it to get the data partitioned into the two files.

edited Sep 11 '17 at 17:30

GalacticJello

11,235
2
25
35

answered Mar 08 '17 at 16:08

Michael Rys

6,684
15
23

Please provide the script for "/output/genscript.usql" file. – Arron Jul 24 '17 at 06:48
Hi Arron, I am not sure what you would like me to provide. The script above generates the genscript.usql content. – Michael Rys Aug 17 '17 at 23:15
How does the @x dataset get passed to the external genscript.usql file? What is the best way to execute it? Data Factory? – coltonfranco Sep 22 '17 at 21:03
You would have to add the @x generation to the script as well (basically union it into the @stmts, add another column to the statement that is null for the OUTPUT and @x = for the assignments, that you can use in the OUTPUT's ORDER BY clause to generate the right order of the statements in the generated script. You can orchestrate this with ADF or PowerShell for example. – Michael Rys Oct 05 '17 at 00:48
...and after that sterilize the affected area and allow for 2-4 weeks of healing – Alex Gordon Sep 06 '19 at 03:25

score 3 · Answer 2 · edited Aug 30 '18 at 09:45

There's a new feature in public preview:

SET @@FeaturePreviews = "DataPartitionedOutput:on";

You can add it at the beginning of the script, and the output data can be partitioned by the key you choose:

SET @@FeaturePreviews = "DataPartitionedOutput:on";

@result =
SELECT date.Hour AS hour,
       COUNT( * ) AS total,
       ANY_VALUE(message) AS sample_message,
       ANY_VALUE(stack_trace) AS sample_trace
FROM @data
WHERE date.Year == 2018
      AND date.Month == 7
GROUP BY date.Hour
HAVING COUNT( * ) > 80;

OUTPUT @result
TO "/output/parquetfiles-{hour}.csv"
ORDER BY hour DESC,
     total DESC
USING Outputters.Csv(outputHeader : true);

Another example can be found in article
Process more files than ever and use Parquet with Azure Data Lake Analytics
section "Putting it all together in a simple end-to-end example".

score 1 · Answer 3 · answered Mar 07 '17 at 10:11

1

Great question! I'll be interested to see what Mr Rys responds with.

Apologies, but this is only half an answer.

My first thoughts are to partition an ADL table using your key value. But then I'm not sure how you'd deal with the separate outputs if a potential WHERE clause isn't deterministic. Maybe CROSS JOIN in every result and .... pass!

It would be nice to have a WHILE loop with some dynamic code!

Check out this post on the MS forums that talks about dynamic input datasets. Just as an FYI.

https://social.msdn.microsoft.com/Forums/en-US/aa475035-2d57-49b8-bdff-9cccc9c8b48f/usql-loading-a-dynamic-set-of-files?forum=AzureDataLake

answered Mar 07 '17 at 10:11

Paul Andrew

3,233
2
17
37

Exactly! With the cursor or while/for loop it is easy task, but without iterations it is unusable for many scenarios..... – peterko Mar 07 '17 at 11:07
I wouldn't go as far as saying unusable. USQL is a declarative language that allows you to massively parallelise your work loads in a flexible/scalable way. Iterating over data is slow and if you need to do this your probably using the wrong tool.... It's not for Oracle cursors! :-) – Paul Andrew Mar 07 '17 at 11:11
As I understood, in Data Lake (without Data Factory) it is not possible orchestrate rowset into files partitioned by key value. Or I'm wrong? – peterko Mar 07 '17 at 11:30
I was also thinking of some sort of "file splitter" based on data using Data Factory - did you find a solution? – Rodney Jun 24 '18 at 23:41
would like to know also if there is a solution to output 1 file per record? – Alex Gordon Sep 16 '19 at 21:08

U-SQL Output in Azure Data Lake

3 Answers3

Linked