It depends on what other datasets you have however here is one approach : Just blindly cut out one second snippets from your audio then perform some judgement as to whether each audio snippet file is actually a single spoken digit.
For each input audio file define a one second window which you pluck out and save into its own file then slide this window further into the audio file and again pluck out the next snippet into its own file.
Since we want one second clips and we do not know where in the source input file our digits lay, once the first window snippet is saved only slide say 100ms over and pluck out the next window. So for each input audio file we will create a succession of overlapping snippets each with its starting point only 100ms separated from previous snippet. To perform this easily use command line tool ffmpeg
https://ffmpeg.org/ffmpeg.html
https://ffmpeg.org/ffmpeg-utils.html#time-duration-syntax
input_audio=audio_from_your_dataset.wav
output_audio=output/aaa
ffmpeg -i $input_audio -ss 0 -t 1 -acodec copy ${output_audio}.0.00.wav
ffmpeg -i $input_audio -ss 0.20 -t 1 -acodec copy ${output_audio}.0.20.wav
ffmpeg -i $input_audio -ss 0.40 -t 1 -acodec copy ${output_audio}.0.40.wav
ffmpeg -i $input_audio -ss 0.60 -t 1 -acodec copy ${output_audio}.0.60.wav
ffmpeg -i $input_audio -ss 0.80 -t 1 -acodec copy ${output_audio}.0.80.wav
ffmpeg -i $input_audio -ss 1.00 -t 1 -acodec copy ${output_audio}.1.00.wav
ffmpeg -i $input_audio -ss 1.20 -t 1 -acodec copy ${output_audio}.1.20.wav
in above the parm -ss defines starting point of snippet in seconds ... so 0.60 will start 600ms into the file ... parm -t defines length of window in seconds
so its output will be
./output/aaa.0.00.wav
./output/aaa.0.20.wav
./output/aaa.0.40.wav
./output/aaa.0.60.wav
./output/aaa.0.80.wav
./output/aaa.1.00.wav
./output/aaa.1.20.wav
issue above on command line ... its not limited to just wav, other codecs are OK too ... now you have several one second snippet audio files plucked from the same input audio ... I would then wrap above process with a meta process which varies the width of your window ... nothing in stone says 1 second so do all of above for windows which vary say from 0.1 seconds to 1 second ... this will geometrically explode the number of snippet files you generate ... bonus points if you add another outermost loop where you vary the incremental time each window starting point is slide over by since 100ms should also be a free variable ... so your code should define three for loops around your ffmpeg calls ( to advance across the input file, to vary window width, to vary window slide )
ffmpeg is the industry standard Swiss Army Knife for audio/video manipulation ( along with Sox ) ... in addition to a command line set of tools ffmpeg is also a set of libraries callable from any language (python, go, ...)
Now perform some ML to identify which of these snippets most closely match what a known spoken digit sounds like to identify which snippets you keep or discard