How to output multiple files dynamically using U-SQL custom Extractors?

Question

I have created a U-SQL query which reads the data from multiple files. Now I have to output the data in multiple files in some other folder. I have the below example,

U-SQL Output in Azure Data Lake

Used the below script,

        DECLARE @storagePath string = @"E:\";
        DECLARE @inputFileSetName string = @"dbo.file{*}.tsv"; 
        DECLARE @outputFileName string = @"dbo.files.csv";

        DECLARE @input string = String.Concat(@storagePath, 
        @inputFileSetName);
        DECLARE @output = string.Concat(@storagePath, @outputFileName);
        @searchlog =
        EXTRACT Id string,
        Name string,
        Address string

        FROM @input
        USING new USQLApplication3.SampleExtractor(Encoding.UTF8);
        @transactions =
        SELECT *,
        ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Id DESC) AS 
        RowNumber_12345
        FROM @searchlog;

       @result =
       SELECT EXTRACT Id,
        Name,
        Address 
       FROM @transactions
       WHERE RowNumber_12345 == 1;

       OUTPUT @result 
       TO @output
       USING Outputters.Csv(encoding : Encoding.UTF8, quoting : false);

Below is the C# code for custom Extractors,

public class SampleExtractor : IExtractor
{
    private Encoding _encoding;
    private byte[] _row_delim;
    private char _col_delim;

    public SampleExtractor(Encoding encoding, string row_delim = "\n", char col_delim = '\t')
    {
        this._encoding = ((encoding == null) ? Encoding.UTF8 : encoding);
        this._row_delim = this._encoding.GetBytes(row_delim);
        this._col_delim = col_delim;
    }

    public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
    {
        string line;
        //Read the input line by line
        foreach (Stream current in input.Split(_encoding.GetBytes("\n")))
        {
            using (StreamReader streamReader = new StreamReader(current, this._encoding))
            {
                line = streamReader.ReadToEnd().Trim();
                //Split the input by the column delimiter
                string[] parts = line.Split(this._col_delim);
                foreach (string part in parts)
                {
                    string res = part;
                    if (res != null)
                    {
                        res = "\"" + res.Replace("\"", "\"\"") + "\"";
                    }
                    output.Set<string>(count, res);
                }

            }
            yield return output.AsReadOnly();

        }
       yield break;
    }
}

How can I dynamically create multiple files on output? As of now, it is creating only a single file.

Edited:

I have also tried with the multiple output in a single u-sql script,

      @x = SELECT * FROM (VALUES( "A", 10, 20), ("A", 11, 21), ("B", 10, 30), ("B", 100, 200)) AS T(name, value1, value2);

     // Generate the script to do partitioned output based on name column:
     DECLARE @out string ="E:/genscript.usql";
     @stmts = 
         SELECT "@res=SELECT value1, value2 FROM @x WHERE name == 
         \""+name+"\"; 
     OUTPUT 
          @res TO \"E:/"+name+".csv\" USING Outputters.Csv();" AS output 
          FROM (SELECT DISTINCT name FROM @x) AS x;

     OUTPUT @stmts TO @out
           USING Outputters.Text(delimiter:' ', quoting:false);

     @path = EXTRACT path string FROM @out USING Extractors.Text() ;
     OUTPUT @path  TO "E:/{*}.usql" 
           USING Outputters.Text(delimiter:' ', quoting:false);

But it is showing the following error,

Severity Code Description Project File Line Suppression State Error E_CSC_USER_READFROMPREVIOUSOUTPUT: The EXTRACT statement reads/references from location E:\genscript.usql that is the target of a previous OUTPUT statement. Description: Compiler does not support OUTPUT to and then reading/referencing from the same file in a single job. Resolution: Please make one job to OUTPUT to the file and a different job to read/reference the same file.

What should I do now to output multiple files from a single u-sql or how I access another u-sql script file from current u-sql script?

you probably need to create a JOB which does only 1 record at a time. then parameterize it and for every record you will have an output — Alex Gordon, Sep 09 '19 at 19:07

wBob · Answer 1 · 2017-07-20T15:00:30.717

2

You can simply use multiple OUTPUT commands, for example

OUTPUT @result1
TO @output1
USING Outputters.Csv(encoding : Encoding.UTF8, quoting : false);

OUTPUT @result2
TO @output2
USING Outputters.Csv(encoding : Encoding.UTF8, quoting : false);
...

U-SQL does not support dynamic U-SQL directly but there are ways of doing it. Two example ways of doing it are here and here.

edited Jul 20 '17 at 15:00

answered Jul 19 '17 at 21:09

wBob

13,710
3
20
37

2

Hi,I have to create it dynamically according to the number of rows in the input file. – Arron Jul 20 '17 at 12:58
In the First Sample , https://stackoverflow.com/questions/43227421/u-sql-split-a-csv-file-to-multiple-files-based-on-distinct-values-in-file/43243301#43243301 – Arron Jul 21 '17 at 10:25
they are not given the "/output/dynamic.usql" query – Arron Jul 21 '17 at 10:25
Then how it would process the data – Arron Jul 21 '17 at 10:26
Arron: The scripts in the example are creating another U-SQL script. So you would orchestrate your execution by first running the generation script, then download the generated script (e.g., dynamic.usql) and then run that script. Note that the feature to output to multiple files is still under development. – Michael Rys Aug 18 '17 at 00:29

score 1 · Answer 2 · answered Mar 05 '20 at 16:20

As of 2020, this is released behind a Feature Preview flag DataPartitionedOutput:on

SET @@FeaturePreviews = "DataPartitionedOutput:on";


@rows =
EXTRACT 
Firstname string,
Lastname string,
Score string
FROM "/input/a.csv"
USING Extractors.Csv(skipFirstNRows:1);


OUTPUT @rows
TO @"output/{Score}.csv"
USING Outputters.Csv();

How to output multiple files dynamically using U-SQL custom Extractors?

2 Answers2