4

I am trying to use and AMLCompute instance to preprocess my data. To do so I need to be able to write the processed data back to the datastore. I am taking this approach because the cluster will automatically shutdown when it is complete so I can let it run until it is done without worrying about paying for more time than is needed.

The problem is when I try to write back to the datastore (which is mounted as a dataset) I get the following error:

OSError: [Errno 30] Read-only file system: '/mnt/batch/tasks/shared/LS_root/jobs/[...]/wav_test'

I have set the access policy for my datastore to allow read, add, create, write, delete, and list, but I don't think that is the issue because I can already write to the datastore from the Microsoft Azure File Explorer.

Is there a way to mount a datastore directly or through a dataset with write privileges from the azureml python sdk?

Alternatively, is there a better way to preprocess this (audio) data on azure for machine learning?

Thanks!

EDIT: I'm adding an example that illustrates the problem.

from azureml.core import Workspace, Dataset, Datastore
import os

ws = Workspace.from_config()
ds = Dataset.get_by_name(ws, name='birdsongs_alldata')

mount_context = ds.mount()
mount_context.start()

os.listdir(mount_context.mount_point)

output:

['audio_10sec', 'mp3', 'npy', 'resources', 'wav']

So the file system is mounted and visible.

# try to write to the mounted file system
outfile = os.path.join(mount_context.mount_point, 'test.txt')

with open(outfile, 'w') as f:
    f.write('test')

Error:

--------------------------------------------------------------------------- OSError                                   Traceback (most recent call last) <ipython-input-9-1b15714faded> in <module>
      1 outfile = os.path.join(mount_context.mount_point, 'test.txt')
      2 
----> 3 with open(outfile, 'w') as f:
      4     f.write('test')

OSError: [Errno 30] Read-only file system: '/tmp/tmp8ltgsx6x/test.txt'
B. Bogart
  • 998
  • 6
  • 15

1 Answers1

2

I've simulated the same scenario in my environment and it has worked. Could you please share the code and the full error message in the question?

Regarding the cost concerns, you can use the aml python sdk to start, stop and wait for the running state with the azureml.core.compute. This way you can have more control over the compute time "running" (start, execute, stop).

The optimal way of dealing preprocess audio files, depends a bit of its content. If the audio contains voice, I strongly recommend you use Azure Cognitive Services - Speech API (speech-to-text).

If it's not voice, you can use the wave module, like in the code below:

from wave import open as open_wave
waveFile = open_wave(<filename>,'rb')
nframes = waveFile.getnframes()
wavFrames = waveFile.readframes(nframes)
ys = numpy.fromstring(wavFrames, dtype=numpy.int16)

Credits

This method is not exclusively from azure, but will allow you to use the data in a structured way.

Daniel Labbe
  • 1,979
  • 3
  • 15
  • 20
  • I've added a code example to the question that includes the full error. I suspect it has to do with mounting a dataset rather than a datastore, but I can't find any documentation on mounting a datastore with the azureml python sdk. Thanks! – B. Bogart Apr 14 '21 at 23:12
  • It seems to be on the mount, indeed. Could you try to save the file locally, in the scope of the execution, and upload it via datastore.upload_files()? – Daniel Labbe Apr 15 '21 at 09:19
  • Thats what I have been doing, but but upload_files doesn't always upload all the files and its time consuming. Thats why I am looking for a way to write to the mount directly. – B. Bogart Apr 15 '21 at 13:28
  • Have you checked the OutputFileDatasetConfig Class? https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig?view=azure-ml-py – Daniel Labbe Apr 15 '21 at 21:19
  • TBH, I don't think that this was the original purpose of this method, but it might work for you. Here there is an example: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-with-datasets#where-to-write-training-output – Daniel Labbe Apr 15 '21 at 21:20
  • 1
    This looks very promising! I will try it and report back. Thanks. – B. Bogart Apr 18 '21 at 00:37