3

I am trying to read several thousands of hours of wav files in python and get their duration. This essentially requires opening the wav file, getting the number of frames and factoring in the sampling rate. Below is the code for that:

def wav_duration(file_name):
    wv = wave.open(file_name, 'r')
    nframes = wv.getnframes()
    samp_rate = wv.getframerate()
    duration = nframes / samp_rate
    wv.close()
    return duration


def build_datum(wav_file):
    key = "/".join(wav_file.split('/')[-3:])[:-4]
    try:
        datum = {"wav_file" : wav_file,
                "labels"    : all_labels[key],
                "duration"  : wav_duration(wav_file)}

        return datum
    except KeyError:
        return "key_error"
    except:
        return "wav_error"

Doing this sequentially will take too long. My understanding was that multi-threading should help here since it is essentially an IO task. Hence, I do just that:

all_wav_files = all_wav_files[:1000000]
data, key_errors, wav_errors = list(), list(), list()

start = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
    # submit jobs and get the mapping from futures to wav_file
    future2wav = {executor.submit(build_datum, wav_file): wav_file for wav_file in all_wav_files}
    for future in concurrent.futures.as_completed(future2wav):
        wav_file = future2wav[future]
        try:
            datum = future.result()
            if datum == "key_error":
                key_errors.append(wav_file)
            elif datum == "wav_error":
                wav_errors.append(wav_file)
            else:
                data.append(datum)
        except:
            print("Generated exception from thread processing: {}".format(wav_file))

print("Time : {}".format(time.time() - start))

To my dismay, I however get the following results (in seconds):

Num threads | 100k wavs | 1M wavs
1           | 4.5       | 39.5
2           | 6.8       | 54.77
10          | 9.5       | 64.14
100         | 9.07      | 68.55

Is this expected? Is this a CPU intensive task? Will Multi-Processing help? How can I speed things up? I am reading files from the local drive and this is running on a Jupyter notebook. Python 3.5.

EDIT: I am aware of GIL. I just assumed that opening and closing a file is essentially IO. People's analysis have shown that in IO cases, it might be counter productive to use multi-processing. Hence I decided to use multi-processing instead.

I guess the question now is: Is this task IO bound?

EDIT EDIT: For those wondering, I think it was CPU bound (a core was maxing out to 100%). Lesson here is to not make assumptions about the task and check it for yourself.

Y91
  • 43
  • 9
  • 1
    Keep in mind that if you are reading from a traditional (spinning) hard drive, that reading from multiple files at once might make things slower. In particular, in a traditional/spinning hard drive, the drive-heads can take a (relatively) long time to seek from one distance-from-the-center-of-the-drive to another, and reading from multiple files in parallel can force the drive heads to seek back and forth more than they would have done if they were just reading a single (contiguous) file at a time. – Jeremy Friesner Jul 04 '18 at 02:22
  • 2
    It's not the right kind of IO task if you're reading from disk. – Mad Physicist Jul 04 '18 at 02:23
  • 1
    @MadPhysicist can you please elaborate on that? – Y91 Jul 04 '18 at 02:32
  • 2
    Python threads are asynchronous but not concurrent. That helps when your io operation is concurrent, like a network request, and a huge hassle if not outright bottleneck if not, like spinning disks. – Mad Physicist Jul 04 '18 at 03:02

1 Answers1

1

Some things to check by category:

Code

  • How efficient is wave.open ? Is it loading the entire file into memory when it could simply be reading header information?
  • Why is max_workers set to 1 ?
  • Have you tried using cProfile or even timeit to get an idea of what particular part of code is taking more time?

Hardware

Re-run your existing setup with some hard disk activity, memory usage and CPU monitoring to confirm that hardware is not your limiting factor. If you see your hard disk running at maximum IO, your memory getting full or all CPU cores at 100% - one of those could be at its limit.

Global Interpreter Lock (GIL)

If there are no obvious hardware limitations, you are most likely running into problems with Python's Global Interpreter Lock (GIL), as described well in this answer. This behavior is to be expected if your code has been limited to running on a single core or there is no effective concurrency in running threads. In this case, I'd most certainly change to multiprocessing, starting by creating one process per CPU core, run that and then compare hardware monitoring results with the previous run.

QA Collective
  • 2,222
  • 21
  • 34
  • Yes, I am aware of GIL. I just assumed that opening and closing a file is essentially IO. [People's analysis](https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b) have shown that in IO cases, it might be counter productive to use multi-processing. Hence I decided to use multi-processing instead. I guess the question now is: Is this task IO bound. – Y91 Jul 04 '18 at 02:16
  • 1
    Thanks! I think it was CPU bound (a core was maxing out to 100%). Lesson here is to not make assumptions about the task and check it for yourself. I am accepting this answer. – Y91 Jul 04 '18 at 02:28
  • Upon further review of your code, I think the problem may be that `max_workers` is set to 1. If you remove that, Python will set it to the number of cores x 5: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor – QA Collective Jul 04 '18 at 02:34
  • I did change the max_workers to various values to do my experiments (look at the table) – Y91 Jul 04 '18 at 02:52
  • Oh, I see now, yes. Glad you found the limitation. I always go for multiprocessing by default these days, multi-threading just seems too quirky. – QA Collective Jul 04 '18 at 02:56