If I need to read lots of files, will it be faster if I break the task into multiple threads?

Question

I recently had an interview with NetApp for a C++ role (they do big data storage systems). I wrote some code to answer an interview question. Their response was “You failed”. It was very difficult to get feedback, as it usually is after failing an interview. After some very polite begging for feedback I got a little bit. But it still didn’t quite make sense.

Here’s the problem:

Given a bunch of files in a directory, read them all and count the words. Create a bunch of threads to read the files in parallel.

The consensus at NetApp (people who know a lot about storage) is that it should get faster with more threads. I think in most circumstances you are so I/O bound that it will get slower after 1 or 2. I just don’t see how it’s possible to get faster unless you are under some know special circumstances (like SAN or maybe RAID arrays) Even in those cases the number of sequential channels to the disk saturates and you are I/O bound again after only a few threads.

I think my code was great (of course). I’ve been writing C++ for many years. I think I know some things about what makes good code. It should have passed on style alone. Hehe. As a general rule, performance optimizations are not something you should guess at, they should be tested and measured. I only had limited time to run experiments. But now I’m curious.

The code is in my GitHub account here:

https://github.com/MenaceSan/CountTextWords

Anyone have any opinions on this? Shed some light on what they might have been thinking? Any other criticisms of the code?

I base part of my opinion on this:

Does multithreading make sense for IO-bound operations?

Sometimes you are really smarter than the people who are interviewing you for a job. I agree with your assessment that this is an I/O bound situation. A few threads might eke out a little bit of extra performance, but I would not bother to go beyond that. — Sam Varshavchik, Dec 21 '18 at 02:03
it is IO bound, but you also do some real work in the thread after it reads a block in memory (parsing the text). So, you can get some performance gain there. Also, you can have files cached in memory. In this case you would get much better utilization. Also depends on the underlying fs and device. So, it makes sense to use a few threads there. — Serge, Dec 21 '18 at 03:47
True, so 2 threads might get some benefit. The CPU work could be done on one thread while the other is sitting in i/o wait mode. The i/o bound side seems like it would always take longer than the cpu part of the work. So more than 2 threads seems like you are back to the same problem where the threads are just wasted. But nothing is 100%. On the right hardware it might actually get benefit from a few more threads. I assume they didnt want me to write a self balancing routine. On an interview ? really ? hehe. I was going for clean code. — Menace, Dec 21 '18 at 15:08

score 0 · Answer 1 · answered Dec 21 '18 at 04:12

The answer is, as you have surmised, it depends a lot on the conditions of the task. And also as you say, you can't know until you actually test.

That said, this was an interview with a big data storage provider. They may have wanted you to either assume the task was talking about a system you would be writing for them (i.e. large amounts of very fast network-based storage), or at the least tell them what your assumptions for the task were. Furthermore they may have wanted you to talk about things like whether file size and number of files mattered and how it would affect things. (And all the other factors - amount of memory on the computer doing the reading, speed of CPU doing the processing, etc.)

You say:

The consensus at NetApp (people who know a lot about storage) is that it should get faster with more threads.

Did they tell you this during your interview? If so, it may be because that's the experience they have with their hardware and software stack. If it was someone from HR who told you this after the interview, I would probably take it with a grain of salt. Engineers trying to communicate this type of information to HR usually ends up in a game of telephone going through one or more managers before it gets to the person you talked to, and their understanding of what was said may not match yours or the engineer's.

When in doubt in an interview, explain what your assumptions are, verify that the interviewers share them, and if not, adjust them to match what they're asking you. They may be making ridiculous assumptions to see what you come up with, or they may just have different experiences than you have.

FWIW, it sounds like you have a reasonable idea of the challenges of this task at least for a typical machine configuration that someone like me uses everyday. I wouldn't have dinged you for that if you explained that's what you were assuming. But not everyone doing interviewing thinks the same way. Sorry you didn't get the job, but from the sound of it, you'll find one soon enough!

If I need to read lots of files, will it be faster if I break the task into multiple threads?

1 Answers1