U-SQL Split a CSV file to multiple files based on Distinct values in file

Question

I have the Data in Azure Data Lake Store and I am processing the data present there with Azure Data Analytic Job with U-SQL. I have several CSV files which contain spatial data, similar to this:

File_20170301.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80

File_20170302.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

Each file contains data for a different day and for all longtitude-lattitude combinations.

I would like to merge all files I have and split the data so I can end up with one file for each longtitude-lattitude combination.

So, after looping through all files in my folder and appending all data for all days, i would end up with the following:

File_45_21.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50

File_48_35.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

In theory the following should happen:

Find distinct values for a combination of longtitude and lattitude in the data
take the above Array of distinct values and create a file for each combination and extract the data to it from the source files, based on the two parameters (longtitude and lattitude)

What I am struggling with is how to get started on the looping and the extraction of data based on two parameters in the source and how to "partition" the data source by the distinct values of the combination of parameters.

score 2 · Answer 1 · edited May 23 '17 at 11:46

U-SQL does not support dynamic U-SQL directly, but it is possible to use "script the script" techniques to create your output. You could then manually run this output, or use something like Powershell or Azure Data Factory to run it.

I have created a simple example based on your test data partially based on the example from here.

// Get the initial fileset
@input =
    EXTRACT longtitude float,
            lattitude float,
            date string,
            hour int,
            value1 int,
            filename string
    FROM "/input/File_201703{filename}"
    USING Extractors.Csv();


// Add int version of the long and lat columns for grouping on
@working =
    SELECT *,
           (int) longtitude AS int_long,
           (int) lattitude AS int_lat
    FROM @input;


// Work out the filenames
@filenames =
    SELECT String.Format("File {0}_{1}.csv", int_long, int_lat) AS outputFilename,
           int_long,
           int_lat
    FROM
    (
        SELECT int_long,
               int_lat
        FROM @working
        GROUP BY int_long,
                 int_lat
    ) AS x;


// Construct the dynamic usql and output it
@output =
    SELECT x.xsort, "@input = EXTRACT longtitude float, lattitude float, date string, hour int, value1 int, filename string FROM \"input/File_201703{filename}\" USING Extractors.Csv();" AS usql
   FROM ( VALUES ( 10 ) ) AS x(xsort)

    UNION ALL

    SELECT x.xsort, "@working = SELECT *, (int) longtitude AS int_long, (int) lattitude AS int_lat FROM @input;" AS usql
    FROM ( VALUES ( 20 ) ) AS x(xsort)

    UNION ALL

    SELECT 30 AS xsort, String.Format("OUTPUT (SELECT * FROM @working WHERE int_long == {0} AND int_lat == {1}) TO \"/output/{2}\" USING Outputters.Csv();", int_long, int_lat, outputFilename) AS usql
    FROM @filenames;


// Select only the usql column and sort the output
@output =
    SELECT usql
    FROM @output
ORDER BY xsort
FETCH 100;


OUTPUT @output
TO "/output/dynamic.usql"
USING Outputters.Text(delimiter : ' ', quoting : false);

Hi @MichaelRys, do you see some problems with the dynamic approach above? Can it be improved or is it a bad idea? Any feedback gratefully received : ) — wBob, Apr 06 '17 at 08:09
While I haven't run it, it looks ok. I may have merged the filename creation and the dynamic U-SQL creation for performance reasons. You basically use the write U-SQL with U-SQL approach I suggested above. — Michael Rys, Apr 07 '17 at 00:33
I am just wondering if it is a good idea to take the data in a database catalog, partition it and write an export to csv file. Would this be a viable solution? USQL can still be used to import the data in a table and to export the data to csv. — FeodorG, Apr 12 '17 at 12:00
Yes you could do that. My question would be what does that add (eg does it improve performance, maintenance or scalability etc)? For me, I add ADLA tables when I can prove they make a valuable difference to the items I list above. If you're talking about ordinary database tables ( eg Azure SQL DB, Azure SQL Data Warehouse) then ADLA can't write to them natively so you'd need a few extra steps - you could rig up something with Azure Data Factory to do this. I would weigh up the added complexity versus the benefit : ) — wBob, Apr 12 '17 at 12:50
Feel free to mark as answer and upvote if that's been useful : ) — wBob, Apr 12 '17 at 14:37

score 2 · Answer 2 · edited May 23 '17 at 11:46

2

I would recommend to look at the notion of File Sets to operate on many files (see https://msdn.microsoft.com/en-us/library/azure/mt771650.aspx) and some of the suggestions to do dynamic outputs based on values until that feature becomes available (see How do I partition a large file into files/directories using only U-SQL and certain fields in the file? as an example).

edited May 23 '17 at 11:46

Community

1
1

answered Apr 06 '17 at 00:30

Michael Rys

6,684
15
23

U-SQL Split a CSV file to multiple files based on Distinct values in file

2 Answers2

Linked