0

Below is a JOIN that I have done. I believe it's not efficient. My u-sql job is taking a long time on this, it's been over six hours. I have lots of data too. My input is 2 GB (millions of folders) LHS & RHS in example below.

I need to find the sub directory count, depth wise where in my example the root is adl://x.azuredatalakestore.net/

Final outcome expected is below, where each level in the directory has the sum of all child directories.

| WorkDir| Depth  | Dir                                                  | NumberOfSubDirectories | IsDirectory|
|       -----------------------------------------------------------------------------------------|
| /      | 0      |adl://x.azuredatalakestore.net/                       | 5                      |  True      |
| /      | 1      |adl://x.azuredatalakestore.net/backup/                | 4                      |  True      |
| /      | 2      |adl://x.azuredatalakestore.net/backup/jenkins/        | 0                      |  True      |
| /      | 2      |adl://x.azuredatalakestore.net/backup/viewer/         | 2                      |  True      |
| /      | 3      |adl://x.azuredatalakestore.net/backup/viewer/2018/    | 1                      |  True      |
| /      | 4      |adl://x.azuredatalakestore.net/backup/viewer/2018/07/ | 0                      |  True      |

Input for join. Left Hand side (LHS) and Right Hand Side (RHS) is same table

WorkDir Depth Dir IsDirectory
/ 0 adl://x.azuredatalakestore.net/ TRUE
/ 1 adl://x.azuredatalakestore.net/backup/ TRUE
/ 2 adl://x.azuredatalakestore.net/backup/jenkins/ TRUE
/ 2 adl://x.azuredatalakestore.net/backup/viewer/ TRUE
/ 3 adl://x.azuredatalakestore.net/backup/viewer/2018/ TRUE
/ 4 adl://x.azuredatalakestore.net/backup/viewer/2018/07/ TRUE

My Join statement

          // Get Sub Directory from Directory list  
@DirWithSubDir =
    SELECT 
           b.WorkDir AS WorkDirD,
           b.Depth AS DepthD,
           b.Dir AS DirD,
           b.IsDirectory AS IsDirectoryD,
           COUNT(b.Dir)-1 AS NumberOfSubDirectoriesD 
    FROM @stream_information AS a
         JOIN
             @stream_information AS b
         ON a.IsDirectory == b.IsDirectory
    WHERE a.Dir.Contains(b.Dir)
    GROUP BY b.WorkDir,
           b.Depth,
           b.Dir,
           b.IsDirectory
     ;

I have another join statement where I get the data in bytes of the folders, it's similar to this, that is slow too.

halfer
  • 19,824
  • 17
  • 99
  • 186
AnnaDaKhokha
  • 59
  • 1
  • 1
  • 5
  • ADLS Gen 1 is being retired in Feb 2024. Although that may seem a way off, we're nearly in 2022 so if you have a chance to change course, eg refactor in Databricks, Azure Synapse Analytics or other engine, you should take it. See [this post](https://stackoverflow.com/a/66700263/1527504) for more details. U-SQL does not have as many turnkey settings for tuning (although there are some hints you can apply) and I think your fundamental issue is having so many files (millions). Think about using another technology and/or consolidating your files into bigger ones. – wBob Nov 16 '21 at 11:33
  • Thanks @wBob . You are right better to Migrate to ADLS Gen2 from ADLS gen1 ASAP . Microsoft has a nice solution to Migrate to ADLS Gen2 https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-migrate-gen1-to-gen2-azure-portal – AnnaDaKhokha Nov 22 '21 at 17:30

1 Answers1

0

We can make combination of all Directories from each line and then do an inner join on the same recordset. The "Dir.Contains(b.Dir)" join was a cross join and hence very expensive.

// make Dir Array 
SELECT  
        *,
        SqlArray.Create(Dir.Split('/')) AS DirArray
FROM @rowset
;


//
//  for each stream "adl://x.azuredatalakestore.net/a/b/" we explode this to :-
//  { "adl://x.azuredatalakestore.net/" , "adl://x.azuredatalakestore.net/a/","adl://x.azuredatalakestore.net/a/b/" }

@subDirExplode =
SELECT SubPath,StreamName
FROM @stream_information_Dir
    CROSS APPLY   
          EXPLODE( Enumerable.Range(3, DirArray.Count() - 3).// for each in array where each dir combination exists
                                                            Select(n=> string.Join("/", DirArray.Take(n)) + "/")) AS r(SubPath) // get all elements 3 to count and make cobinations  
   ;

// count all combination as the number of sub directories 
@dir_with_count =
SELECT 
        l.StreamName AS StreamName,
        COUNT(r.SubPath) -1  AS NumberOfSubDirectories
FROM @stream_information_Dir AS l
     LEFT OUTER JOIN
         @subDirExplode AS r
     ON l.StreamName == r.SubPath
GROUP BY 
         l.StreamName
   ;
halfer
  • 19,824
  • 17
  • 99
  • 186
AnnaDaKhokha
  • 59
  • 1
  • 1
  • 5