Below is a JOIN that I have done. I believe it's not efficient. My u-sql job is taking a long time on this, it's been over six hours. I have lots of data too. My input is 2 GB (millions of folders) LHS & RHS in example below.
I need to find the sub directory count, depth wise where in my example the root is adl://x.azuredatalakestore.net/
Final outcome expected is below, where each level in the directory has the sum of all child directories.
| WorkDir| Depth | Dir | NumberOfSubDirectories | IsDirectory|
| -----------------------------------------------------------------------------------------|
| / | 0 |adl://x.azuredatalakestore.net/ | 5 | True |
| / | 1 |adl://x.azuredatalakestore.net/backup/ | 4 | True |
| / | 2 |adl://x.azuredatalakestore.net/backup/jenkins/ | 0 | True |
| / | 2 |adl://x.azuredatalakestore.net/backup/viewer/ | 2 | True |
| / | 3 |adl://x.azuredatalakestore.net/backup/viewer/2018/ | 1 | True |
| / | 4 |adl://x.azuredatalakestore.net/backup/viewer/2018/07/ | 0 | True |
Input for join. Left Hand side (LHS) and Right Hand Side (RHS) is same table
WorkDir | Depth | Dir | IsDirectory | |
---|---|---|---|---|
/ | 0 | adl://x.azuredatalakestore.net/ | TRUE | |
/ | 1 | adl://x.azuredatalakestore.net/backup/ | TRUE | |
/ | 2 | adl://x.azuredatalakestore.net/backup/jenkins/ | TRUE | |
/ | 2 | adl://x.azuredatalakestore.net/backup/viewer/ | TRUE | |
/ | 3 | adl://x.azuredatalakestore.net/backup/viewer/2018/ | TRUE | |
/ | 4 | adl://x.azuredatalakestore.net/backup/viewer/2018/07/ | TRUE |
My Join statement
// Get Sub Directory from Directory list
@DirWithSubDir =
SELECT
b.WorkDir AS WorkDirD,
b.Depth AS DepthD,
b.Dir AS DirD,
b.IsDirectory AS IsDirectoryD,
COUNT(b.Dir)-1 AS NumberOfSubDirectoriesD
FROM @stream_information AS a
JOIN
@stream_information AS b
ON a.IsDirectory == b.IsDirectory
WHERE a.Dir.Contains(b.Dir)
GROUP BY b.WorkDir,
b.Depth,
b.Dir,
b.IsDirectory
;
I have another join statement where I get the data in bytes of the folders, it's similar to this, that is slow too.