I'm using azure ml python sdk to upload files to azure datastore, as described in here
in detail:
workspace = Workspace(
subs_id, resource_group, workspace_name
)
datastore = Datastore.get(
workspace,
datastore_name
)
Dataset.File.upload_directory(
dataset_path_on_disk,
(datastore, dataset_path_on_store),
overwrite=True,
show_progress=True,
)
There are some 100K files in the directory. Each file is pretty small aprox. 0.5MB. But uploading each of them takes more than a minute..! So I am doing something terribly wrong here.
These files are in azure compute engine / virtual machine that is copying them from a common disk space, i.e. from home/azureuser/cloudfiles/code/Users/
. So it's just copying files inside the azure infra. Both the compute engine, workspace and datastore are in the same directory/subscription/workspace and geographic location.
The datastore type is "Azure Blob Storage".
EDIT 1: This seems to have something to do with the huge number of files in the directory.
If I create a directory with a smaller number of files (say ~ 100), then the uploading process proceeds very fast per file.
It is as if after each and every file upload the upload_directory
would be doing some weird **it, say, scanning the whole directory again, etc.?
EDIT 2:
If the directory to be uploaded has subdirectories then things seem to get stuck always.
If there are only files in it, then, around 50% of times, the uploading proceeds as expected: once upload_directory
tells you that "Loading N files" then you're in the clear and the upload starts in a few seconds. But many times it gets stuck into the "Uploading file to .." stage.
The whole upload_directory
thing seems completely unstable / unreliable (it's microsoft, after all), so I recommend just using the local file mounts - then you can't of course send "batch jobs" to compute clusters, but I haven't anyone seen ever to use those either since it has all been made too complicated (again, microsoft).
Recommendation: just do everything "interactively" in the horrible web terminal interface.