3

I want to allocate more vertexes to the extraction job, tried using ROWCOUNT hint, it doesn't seem to work, no matter what value I use for ROWCOUNT, U-SQL always allocate the same number of vertexes.

EXTRACT xxxx FROM @"Path" USING new RndsInDataLakeCode.PyramidExtractorMerged() OPTION(ROWCOUNT=50000000); Is there any other way to influence vertexes allocation

Thanks.

Michael Rys
  • 6,684
  • 15
  • 23
lidong
  • 556
  • 1
  • 4
  • 20
  • How many files are matched by your path? I have the impression (only tried AvroExtractor so far) that it is one vertex per file, there is not file splitting like Hadoop does. – Iain Mar 08 '17 at 01:55
  • this job extracts from 800 files. – lidong Mar 09 '17 at 18:03

2 Answers2

3

Basically the number of vertices used by EXTRACT are being determined by the following:

  1. Number of files (currently at most one file per vertex) if you use file sets or request AtomicFileProcessing=true (e.g., JSON, current Avro Extractor).
  2. Size of a file (currently 1GB per vertex) if the file is considered splittable (AtomicFileProcessing=false, e.g., Csv/Tsv extractors).

The ROWCOUNT hint will only hint the resulting row count that will impact the subsequent partitioning.

Then the Analytics Units allocation mentioned by Omid will give you the actual degree of parallelism that is used to parallelize within the determined number of vertices (so overspecifying the Analytics Units will NOT make your code parallelize more).

Why do you want to increase the scale-out on the extraction?

Michael Rys
  • 6,684
  • 15
  • 23
  • Thanks Michael, currently it takes 30 minute to extract 800 files using 60 vertexes, I want to speed up by using more vertexes. that's why I want to increase the scale out. – lidong Mar 09 '17 at 17:16
  • You should probably use the fast file set preview feature. Contact me in email for the setting. – Michael Rys Mar 13 '17 at 17:47
0

How many ADLU did you specify when submitting the job? This determines the maximum number of parallel vertices that can run at one time and makes the biggest difference in the parallelism for extracts. As long as the files can be split by rows USQL will absolutely break files into smaller pieces and parallelize execution. If the file is in a binary format (e.g. Compressed) or json it has to be processed on a single vertex since these formats cannot be split directly.

Note that the number of ADLU you specify will be reserved for the length of the job and you'll be charged for them. So you'll want to balance between faster extract and overall job time.

OmidA
  • 91
  • 3