Efficiency of creating a doubly linked list from a queue for finding median vs using an array

Question

I am working on an assignment for school simulating a line with students and multiple windows open at the Registrar's Office. We have to compute stats such as mean, median, longest wait time, etc to display at the end of the simulation.

Someone suggested to me that I create another doubly linked list to put students in when they are "done" at the window in order to compute these stats.

Isn't it more efficient to just keep track of those things (except for median) while the loop/program is running? Then for the median I could just create an array of the student wait times and sort the whole array after the student line has been gone through.

Or is it more efficient/better practice to do another doubly linked list for the students after they are done at the window and then just use that to compute the stats at the end?

Unless your school has several million students, it is unlikely that performance will be a concern for your use case. That said, there are few performance problems that aren't made worse by using a linked list. — Wintermute, Mar 29 '15 at 17:01
I understand but I am more in the mind set of for future careers and would like to learn the best way while I'm doing these assignments. So you are suggesting to use a doubly linked list in order to have the least performance problems? — Melissa, Mar 29 '15 at 17:09
No. It is not generally possible to say that any container will guarantee a performance loss or gain in all circumstances, but the circumstances under which a linked list is likely to give a performance gain are rare and special (dancing links in backtracking algorithms are an example). Modern computers are heavily optimized for sequential memory access, and the memory layout of a linked list is very bad in that respect; this usually makes a linked list much slower to use in practice than a vector or deque. Keep in mind that such intuitions are not a substitute for benchmarking. — Wintermute, Mar 29 '15 at 17:17

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

There are two performance considerations here;

What is the best container type to use in your situation?

I'm going to compare your choice of linked list with a vector, which is my recommendation for the scenario;

You're talking about storing a number of entries and then at some point in time iterating over them all to calculate some stats, using a linked list of any sort is immediately a concern when considering iteration. The problem with linked lists is that the memory is non-contiguous and thus while you're iterating the items there is a much higher rate of cache misses, see what every programmer should know about memory for a very high detail explanation, but in short; if the memory is not already loaded into the cache then you will have to wait for it to load, this takes a significant amount of time and is an easy win for algorithm-level optimisations. Vectors on the other hand are contiguous, and thus iterating them is guaranteed to incur the lowest possible cache miss rate, but it of course comes at a cost with the trade off often being insertions.

Linked lists are much better for scenarios where you expect to do a lot of insertions, removals or moving of entries between the dataset as they simply swap some pointers and carry on, e.g. alphabetised list of AAA CCC BBB, AAA added first, then CCC, BBB needs inserting, it should be before CCC. Vectors handle insertions particularly badly, often having to move all the data between the insertion point and the end one place over, if you were to always insert at the beginning of the vector then you'd have to move the elements over each time! None the less; you didn't state needing sorted data so I'm assuming this wont be an issue.

In addition to moving data, due to vectors guaranteeing contiguous memory, when they reach the end of their current allocation they must reallocate and copy the entire contents to the new allocation. If you know in advance how many entries there will be then you can avoid this issue by using reserve(n) which will allocate enough memory to store n entries, failing this the vector will reallocate each time you pass the current n, however assuming your implementation uses a smarter reallocation strategy than just doubling the current size, this cost decreases for larger reallocations.

Lazy or Eager Evaluation

The question you asked was whether you should evaluate your stats with each entry, or at the end, or a third option worth considering; only when it's asked for! This is a consideration of evaluation strategies. In your case I would most definitely recommend either a truly or partially lazy evaluation, unless you need a running total of the mean/median values there is no point calculating them for each new entry, only calculate it if you need. Simply store a bool as true whenever you calculate the average data, and set to false each insertion or removal from the data set.

Efficiency of creating a doubly linked list from a queue for finding median vs using an array

1 Answers1

What is the best container type to use in your situation?

Lazy or Eager Evaluation