3

I have huge amount of data (>10000000) with type int with every new item I want to calculate the median (so I will have >1000000 medians). Should I maintain a sorted list and insert items into this list in order then calculate the median each time or should I insert then sort the list each time.

Also would a std::vector be an appropriate data structure for this? or will another data structure give better complexity

Note: I can't use std::set since there might be duplicates also If use std::multiset finding median will increase complexity since I will loop from beginning until middle to get its value.

Aya Abdelsalam
  • 502
  • 3
  • 8
  • 21
  • If your edit was a response to my answer, then I must not have explained my solution clearly enough; it does *not* involve iterating from the beginning of the multiset to the middle. Should I edit my answer? – Beta Jul 13 '15 at 04:36
  • @Beta No edited it before I saw your answer I didn't think of iterator pointing to the median your solution is brilliant thank you – Aya Abdelsalam Jul 13 '15 at 04:52
  • @samgak yes it is but its huge and with each line i need to get median so that wont be very efficient – Aya Abdelsalam Jul 13 '15 at 04:54

1 Answers1

2

I would use std::multiset, since it can handle duplicates and maintains sorted order automatically. I would insert the numbers one by one, maintaining an iterator pointing to the median (stepping forward or backward depending on whether the new element is greater or less than the median).

Note that if this gets too large to hold comfortably in memory, you can pack a lot of the highest and lowest elements into files; it's unlikely that the median will ever move that far, and if it does you can unpack and repack.

Beta
  • 96,650
  • 16
  • 149
  • 150