5

I have a few thousand of video files in my BlobStorage, which I set it as a datastore. This blob storage receives new files every night and I need to split the data and register each split as a new version of AzureML Dataset.

This is how I do the data split, simply getting the blob paths and splitting them.

container_client = ContainerClient.from_connection_string(AZ_CONN_STR,'keymoments-clips')
blobs = container_client.list_blobs('soccer')
blobs = map(lambda x: Path(x['name']), blobs)
train_set, test_set = get_train_test(blobs, 0.75, 3, class_subset={'goal', 'hitWoodwork', 'penalty', 'redCard', 'contentiousRefereeDecision'})
valid_set, test_set = split_data(test_set, 0.5, 3)

train_set, test_set, valid_set are just nx2 numpy arrays containing blob storage path and class.

Here is when I try to create a new version of my Dataset:

datastore = Datastore.get(workspace, 'clips_datastore')

dataset_train = Dataset.File.from_files([(datastore, b) for b, _ in train_set[:4]], validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)

How is it possible that the Dataset creation seems to hang for an indefinite time even with only 4 paths? I saw in the doc that providing a list of Tuple[datastore, path] is perfectly fine. Do you know why?

Thanks

Utkarsh Pal
  • 4,079
  • 1
  • 5
  • 14
3nomis
  • 1,175
  • 1
  • 9
  • 30
  • 1
    what version of the SDK are you using? The first thing people will ask to see if you're on the newest version of the SDK. try `pip list | grep "azureml"` to get a list to share. `azureml-dataprep` is the most important library here to share the version – Anders Swanson Jul 28 '21 at 14:40
  • also, how big is each file on average? does it work when you exclude the `partition_format` param? Does it work when you pass a single file? – Anders Swanson Jul 28 '21 at 14:41
  • 1
    @AndersSwanson my azureml version is `azureml-dataprep 2.18.0 azureml-dataprep-native 36.0.0 azureml-dataprep-rslex 1.16.1` on avg each file is 5 MB as they are videos, when I pass a single file it still takes a lot. I am not sure why the dimension matters though as I guess its just metadata handling don't you agree? `partition_format` has no influence. – 3nomis Jul 28 '21 at 16:19
  • @3nomis did you manage to sort this out? I am also trying to register a dataset already present in the datastore with the Python SDK and dataset instantiation never ends. – jarandaf Sep 16 '21 at 12:51
  • 1
    @jarandaf Unluckily not. The Azure SDK seems to behave completely random. Sometimes is fast and sometimes it never ends. – 3nomis Sep 20 '21 at 07:53
  • I'm seeing the same "sometimes it is fast and sometimes it doesn't run" behavior. I ran into this issue shortly after uploading files to a datastore using FileDatasetFactory. Trying to create a dataset from the files on the datastore using azureml.core Dataset.File.from_files() would hang indefinitely, and then after a while began to create the dataset in less than a second. After repeated attempts, eventually the from_files() command would work nearly instantly. Creating a new folder on the same storage account and trying to create a local dataset using from_files() again worked instantly – Ryan Cole Oct 09 '22 at 11:49

3 Answers3

1

Do you have your Azure Machine Learning Workspace and your Azure Storage Account in different Azure Regions? If that's true, latency may be a contributing factor with validate=True.

ynpandey
  • 51
  • 4
  • No different regions and even with `validate=False` seems no success. I am not sure the under the hood downloads or transfers all the files somewhere else, it'd be useless. – 3nomis Aug 05 '21 at 09:34
0

I'd be interested to see what happens if you run the dataset creation code twice in the same notebook/script. Is it faster the second time? I ask because it might be an issue with the .NET core runtime startup (which would only happen on the first time you run the code)

EDIT 9/16/20

While it doesn't seem to make sense that .NET core invoked when not data is moving, is suspect it is the validate=True part of the param that requires that all the data be inspected (which can computationally expensive). I'd be interested to see what happens if that param is False

Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
  • 1
    i'm not completely sure but I think what you are saying is close to true. I tried with only one file and waited the "forever" to finish. Then I run it with all the training data an went fast. When I run it another time it went "forever" again. The API or .NET runtime seems to behave quite random. – 3nomis Aug 05 '21 at 09:59
  • 1
    interesting!!! I'd be interested to know if re-installing .NET core might help? That where the issue seems to be. I was told that the .NET core runtime can take 30 seconds to spin up initially, but then for the second execution time it should execute right away. This maps with your experience as well. What's curious (and I had no idea about) is perhaps the runtime shuts back down after a while? – Anders Swanson Aug 05 '21 at 16:27
  • @AndersSwanson I don't follow. Are you saying that in order to register a dataset (already present in a datastore) with the Python SDK I must have .NET core running somewhere? I find this crazy. I am also experiencing this "forever" behaviour, and it looks to me the SDK is trying to download all the data before the dataset registration (a complete non-sense). – jarandaf Sep 16 '21 at 12:49
  • @jarandaf yeah I agree in principle that dataset registration shouldn't be require a lot of work. I just updated my answer to clarify my "thinking" (read: hypothesis) – Anders Swanson Sep 16 '21 at 16:33
  • @AndersSwanson thank you for the update. In my case I tried to register a dataset with Dataset.Tabular.from_parquet_files API (and validate=False) and it hangs no matter what. When using Dataset.File.from_files API worked fine. Something weird is happening under the hood. – jarandaf Sep 17 '21 at 06:15
  • @jarandaf are your ML workspace and your datastore in the same region?? – Anders Swanson Sep 17 '21 at 17:14
  • @AndersSwanson yep – jarandaf Sep 18 '21 at 18:15
0

Another possibility may be slowness in the way datastore paths are resolved. This is an area where improvements are being worked on.

As an experiment, could you try creating the dataset using a url instead of datastore? Let us know if that makes a difference to performance, and whether it can unblock your current issue in the short term.

Something like this:

dataset_train = Dataset.File.from_files(path="https://bloburl/**/*.mp4?accesstoken", validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
Monica Kei
  • 46
  • 5
  • Thanks for the suggestion, however the problem still remains. You never faced it when trying to version datasets? – 3nomis Aug 06 '21 at 07:54
  • I have not ran into this problem. Are you saying this only happens when versioning? Does the slowness happen if you are registering a new dataset each time? – Monica Kei Aug 07 '21 at 00:32
  • No the slowness happens when I call `Dataset.File.from_files`. It looks strange to me as I am not expecting the framework to download or move the files. It is just metadata handling am I correct? – 3nomis Aug 09 '21 at 07:44
  • Oh I see, I thought `.register` was the slow part. Could you try one more thing - upgrade azureml-core and azureml-dataset-runtime to latest versions, and set `is_file=True` when calling `.from_files`? Link to latest documentation of [Dataset.File.from_files](https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory?view=azure-ml-py#from-files-path--validate-true--partition-format-none--is-file-false-) – Monica Kei Aug 10 '21 at 07:10
  • Another question - do you have proxies set up by any chance? – Monica Kei Aug 10 '21 at 17:41
  • No proxies whatsoever. – 3nomis Sep 22 '21 at 09:31