1

I am trying to generate a dynamic FROM clause in U-SQL so that we can extract data from different files based on a previous query outcome. That's something like this:

@filesToExtract = SELECT whatevergeneratesthepaths from @foo; <-- this query generates a rowset with all the file we want to extract like: [/path/file1.csv, /path/file2.csv]
SELECT * FROM @filesToExtract; <-- here we want to extract the data from file1 and file2

I'm afraid that this kind of dynamics queries are not supported yet, but can someone help pointing me out the way to achieve this? It seems that the only feasible approach is to generate another U-SQL script and execute it afterwards.

Thanks in advance.

jsanchez
  • 64
  • 5

1 Answers1

2

It is not fully clear from your question if you want the file names to be dynamically retrieved and passed to an EXTRACT statement, or the name of tables/rowsets and passed to a SELECT's FROM clause. Or both.

In general, you cannot dynamically generate source names from your U-SQL expression. You may want to file a feature request here http://aka.ms/adlfeedback for dynamically or statically parameterizable sources.

Having said that, depending on your exact requirements, there may be some ways to achieve your goals without the work-around you describe.

For example, you could write your code as a parameterized table-valued function and then pass the different rowsets with different scripts, or - if you statically can decide which rowset to choose - you can use the IF statement.

Here is a pseudo-code example:

DECLARE EXTERNAL @someconditionparameter Boolean = true;

IF (@someconditionparameter) THEN
   @data = EXTRACT a int, b string FROM @fileset1 USING Extractors.Csv();
ELSE
   @data = EXTRACT a int, b string FROM @file2 USING ...;
END;

@results = MyTableValuedFunction (@data);
...

If your files are schematized differently, you may be able to use flexible column sets (currently in preview, see release notes) in the TVF to handle the variability of the rowset schema.

Michael Rys
  • 6,684
  • 15
  • 23
  • Thanks for your prompt reply. Even though it seems to me a quite interesting approach, I am afraid that it is not what I am looking for. To clarify, what I want to do is to "choose" which files I am going to extract depending on a previous query. Let's say that I have stored in database the filepaths of some files in data lake store, I would like to load a bunch of files which meet some condition (like a state or something). I hope my question is clearer now, anyways I will raise a feature request in the uservoice link you sent me before. Thanks again for your time and efforts. Regards. – jsanchez Aug 22 '17 at 08:50
  • @jsanchez I am still not sure. Do the file paths have some pattern that you could specify a fileset over? Are the states part of the filepath's name or in the content of the files? Do they have the same schema or different schemata? If for example, you have files that fit a pattern in a path and you only want to read files that match a certain path that you want to parameterize, you can use file sets and query the virtual columns. That will push the selection into the extractor and only read files that match the queries. – Michael Rys Aug 22 '17 at 17:23